diff mbox

[RFC/RFT,v2,1/3] arm/arm64: pageattr: add set_memory_nc

Message ID 1431516714-25816-2-git-send-email-drjones@redhat.com
State New
Headers show

Commit Message

Andrew Jones May 13, 2015, 11:31 a.m. UTC
Provide a method to change normal, cacheable memory to non-cacheable.
KVM will make use of this to keep emulated device memory regions
coherent with the guest.

Signed-off-by: Andrew Jones <drjones@redhat.com>
---
 arch/arm/include/asm/cacheflush.h   | 1 +
 arch/arm/mm/pageattr.c              | 7 +++++++
 arch/arm64/include/asm/cacheflush.h | 1 +
 arch/arm64/mm/pageattr.c            | 8 ++++++++
 4 files changed, 17 insertions(+)

Comments

Christoffer Dall May 14, 2015, 11:05 a.m. UTC | #1
On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> Provide a method to change normal, cacheable memory to non-cacheable.
> KVM will make use of this to keep emulated device memory regions
> coherent with the guest.
> 
> Signed-off-by: Andrew Jones <drjones@redhat.com>

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>

But you obviously need Russell and Will/Catalin to ack/merge this.

-Christoffer
Andrew Jones May 14, 2015, 1:46 p.m. UTC | #2
On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > Provide a method to change normal, cacheable memory to non-cacheable.
> > KVM will make use of this to keep emulated device memory regions
> > coherent with the guest.
> > 
> > Signed-off-by: Andrew Jones <drjones@redhat.com>
> 
> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> 
> But you obviously need Russell and Will/Catalin to ack/merge this.

I guess this patch is going to go away in the next round. You've
pointed out that I screwed stuff up royally with my over eagerness
to reuse code. I need to reimplement change_memory_common, but a
version that takes an mm, which is more or less what I did in the
last version of this series, back when I was pinning pages.

drew
Christoffer Dall May 15, 2015, 2:51 p.m. UTC | #3
On Thu, May 14, 2015 at 03:46:44PM +0200, Andrew Jones wrote:
> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> > On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > > Provide a method to change normal, cacheable memory to non-cacheable.
> > > KVM will make use of this to keep emulated device memory regions
> > > coherent with the guest.
> > > 
> > > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > 
> > Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> > 
> > But you obviously need Russell and Will/Catalin to ack/merge this.
> 
> I guess this patch is going to go away in the next round. You've
> pointed out that I screwed stuff up royally with my over eagerness
> to reuse code. I need to reimplement change_memory_common, but a
> version that takes an mm, which is more or less what I did in the
> last version of this series, back when I was pinning pages.
> 
Yeah, I just read this one before looking at the others because it was a
simple one...

-Christoffer
Catalin Marinas May 18, 2015, 3:53 p.m. UTC | #4
On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> > On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > > Provide a method to change normal, cacheable memory to non-cacheable.
> > > KVM will make use of this to keep emulated device memory regions
> > > coherent with the guest.
> > > 
> > > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > 
> > Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> > 
> > But you obviously need Russell and Will/Catalin to ack/merge this.
> 
> I guess this patch is going to go away in the next round. You've
> pointed out that I screwed stuff up royally with my over eagerness
> to reuse code. I need to reimplement change_memory_common, but a
> version that takes an mm, which is more or less what I did in the
> last version of this series, back when I was pinning pages.

I kept wondering what this patch and the next one are doing with
set_memory_nc(). Basically you were trying to set the cache attributes
for the kernel linear mapping or kmap address (in the 32-bit arm case,
which is removed anyway on kunmap).

What you need is changing the attributes of the user mapping as accessed
by Qemu but I don't think simply re-implementing change_memory_common()
would work, you probably need to pin the pages in memory as well.
Otherwise, the kernel may remove such pages and, when bringing them
back, would set the default cacheability attributes.

Another way would be to split the vma containing the non-cacheable
memory so that you get a single vma with the vm_page_prot as
Non-cacheable.

Yet another approach could be for KVM to mmap the necessary memory for
Qemu via a file_operations.mmap call (but that's only for ranges outside
the guest "RAM").

I didn't have time to follow these threads in details, but just to
recap my understanding, we have two main use-cases:

1. Qemu handling guest I/O to device (e.g. PCIe BARs)
2. Qemu emulating device DMA

For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
this memory slot. The memory attributes in this case could be Device
because that's how the guest would normally map it. The
file_operations.mmap trick would work in this case but this means
expanding the KVM ABI beyond just an ioctl().

For (2), since Qemu is writing to the guest "RAM" (e.g. video
framebuffer allocated by the guest), I still think the simplest is to
tell the guest (via DT) that such device is cache coherent rather than
trying to remap the Qemu mapping as non-cacheable.
Andrew Jones May 19, 2015, 10:03 a.m. UTC | #5
Hi Catalin,

Thanks for the feedback. Some comments to your comments below.

On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
> > On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> > > On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > > > Provide a method to change normal, cacheable memory to non-cacheable.
> > > > KVM will make use of this to keep emulated device memory regions
> > > > coherent with the guest.
> > > > 
> > > > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > > 
> > > Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> > > 
> > > But you obviously need Russell and Will/Catalin to ack/merge this.
> > 
> > I guess this patch is going to go away in the next round. You've
> > pointed out that I screwed stuff up royally with my over eagerness
> > to reuse code. I need to reimplement change_memory_common, but a
> > version that takes an mm, which is more or less what I did in the
> > last version of this series, back when I was pinning pages.
> 
> I kept wondering what this patch and the next one are doing with
> set_memory_nc(). Basically you were trying to set the cache attributes
> for the kernel linear mapping or kmap address (in the 32-bit arm case,
> which is removed anyway on kunmap).

Yeah, I was way to hasty with this version... Sorry for wasting people's
time with it.

> 
> What you need is changing the attributes of the user mapping as accessed
> by Qemu but I don't think simply re-implementing change_memory_common()
> would work, you probably need to pin the pages in memory as well.
> Otherwise, the kernel may remove such pages and, when bringing them
> back, would set the default cacheability attributes.

Yup, I read that code and saw it would inherit the memory attributes
from the vma. At the time I wasn't thinking about the userspace mapping
though, and thus had convinced myself that if "the" mapping goes away,
then we'll be invalidating the guest's mapping too, and thus we'd end up
faulting it in again when necessary, and at that time resetting the
attributes. If the guest never faulted it in again, then it wouldn't
have mattered what the attributes were anyway. Of course I was thinking
about the wrong mapping...

> 
> Another way would be to split the vma containing the non-cacheable
> memory so that you get a single vma with the vm_page_prot as
> Non-cacheable.

This sounds interesting. Actually, it even crossed my mind once when I
first saw that the vma would overwrite the attributes, but then, sigh,
I let my brain take a stupidity bath.

> 
> Yet another approach could be for KVM to mmap the necessary memory for
> Qemu via a file_operations.mmap call (but that's only for ranges outside
> the guest "RAM").

I guess I prefer the vma splitting, rather than this (the vma creating
with mmap), as it keeps the KVM interface from changing (as you point out
below). Well, unless there are other advantages to this that are worth
considering?

> 
> I didn't have time to follow these threads in details, but just to
> recap my understanding, we have two main use-cases:
> 
> 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> 2. Qemu emulating device DMA
> 
> For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> this memory slot. The memory attributes in this case could be Device
> because that's how the guest would normally map it. The
> file_operations.mmap trick would work in this case but this means
> expanding the KVM ABI beyond just an ioctl().
> 
> For (2), since Qemu is writing to the guest "RAM" (e.g. video
> framebuffer allocated by the guest), I still think the simplest is to
> tell the guest (via DT) that such device is cache coherent rather than
> trying to remap the Qemu mapping as non-cacheable.

If we need a solution for (1), then I'd prefer that it work and be
applied to (2) as well. Anyway, I'm still not 100% sure we can count on
all guest types (booloaders, different OSes) to listen to us. They may
assume non-cacheable is typical and safe, and thus just do that always.
We can certainly change some of those bootloaders and OSes, but probably
not all of them.

Thanks,
drew
Catalin Marinas May 19, 2015, 11:18 a.m. UTC | #6
On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > Another way would be to split the vma containing the non-cacheable
> > memory so that you get a single vma with the vm_page_prot as
> > Non-cacheable.
> 
> This sounds interesting. Actually, it even crossed my mind once when I
> first saw that the vma would overwrite the attributes, but then, sigh,
> I let my brain take a stupidity bath.
> 
> > 
> > Yet another approach could be for KVM to mmap the necessary memory for
> > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > the guest "RAM").
> 
> I guess I prefer the vma splitting, rather than this (the vma creating
> with mmap), as it keeps the KVM interface from changing (as you point out
> below). Well, unless there are other advantages to this that are worth
> considering?

The advantage is that you don't need to deal with the mm internals in
the KVM code.

But you can probably add such code directly to mm/ and reuse some of the
existing code in there already as part of change_protection(),
mprotect_fixup(), sys_mprotect(). Actually, once you split the vma and
set the new protection (something similar to mprotect_fixup), it looks
to me like you can just call change_protection(vma->vm_page_prot).

> > I didn't have time to follow these threads in details, but just to
> > recap my understanding, we have two main use-cases:
> > 
> > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > 2. Qemu emulating device DMA
> > 
> > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > this memory slot. The memory attributes in this case could be Device
> > because that's how the guest would normally map it. The
> > file_operations.mmap trick would work in this case but this means
> > expanding the KVM ABI beyond just an ioctl().
> > 
> > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > framebuffer allocated by the guest), I still think the simplest is to
> > tell the guest (via DT) that such device is cache coherent rather than
> > trying to remap the Qemu mapping as non-cacheable.
> 
> If we need a solution for (1), then I'd prefer that it work and be
> applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> all guest types (booloaders, different OSes) to listen to us. They may
> assume non-cacheable is typical and safe, and thus just do that always.
> We can certainly change some of those bootloaders and OSes, but probably
> not all of them.

That's fine by me. Once you get the vma splitting and attributes
changing done, I think you get the second one for free.

Do we want to differentiate between Device and Normal Non-cacheable
memory? Something like KVM_MEMSLOT_DEVICE?

Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
pgprot_noncached() returns Strongly Ordered memory. For Normal
Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).

Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to
ensure that guest and host access it coherently (which would mean
writecombine for ARM). That's similar naming to functions like
dma_alloc_coherent() that return cacheable or non-cacheable memory based
on what the device supports. Anyway, I'm not to bothered with the
naming.
Andrew Jones May 19, 2015, 11:38 a.m. UTC | #7
On Tue, May 19, 2015 at 12:18:54PM +0100, Catalin Marinas wrote:
> On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> > On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > > Another way would be to split the vma containing the non-cacheable
> > > memory so that you get a single vma with the vm_page_prot as
> > > Non-cacheable.
> > 
> > This sounds interesting. Actually, it even crossed my mind once when I
> > first saw that the vma would overwrite the attributes, but then, sigh,
> > I let my brain take a stupidity bath.
> > 
> > > 
> > > Yet another approach could be for KVM to mmap the necessary memory for
> > > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > > the guest "RAM").
> > 
> > I guess I prefer the vma splitting, rather than this (the vma creating
> > with mmap), as it keeps the KVM interface from changing (as you point out
> > below). Well, unless there are other advantages to this that are worth
> > considering?
> 
> The advantage is that you don't need to deal with the mm internals in
> the KVM code.
> 
> But you can probably add such code directly to mm/ and reuse some of the
> existing code in there already as part of change_protection(),
> mprotect_fixup(), sys_mprotect(). Actually, once you split the vma and
> set the new protection (something similar to mprotect_fixup), it looks
> to me like you can just call change_protection(vma->vm_page_prot).

I'll start playing around with this today.

> 
> > > I didn't have time to follow these threads in details, but just to
> > > recap my understanding, we have two main use-cases:
> > > 
> > > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > > 2. Qemu emulating device DMA
> > > 
> > > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > > this memory slot. The memory attributes in this case could be Device
> > > because that's how the guest would normally map it. The
> > > file_operations.mmap trick would work in this case but this means
> > > expanding the KVM ABI beyond just an ioctl().
> > > 
> > > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > > framebuffer allocated by the guest), I still think the simplest is to
> > > tell the guest (via DT) that such device is cache coherent rather than
> > > trying to remap the Qemu mapping as non-cacheable.
> > 
> > If we need a solution for (1), then I'd prefer that it work and be
> > applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> > all guest types (booloaders, different OSes) to listen to us. They may
> > assume non-cacheable is typical and safe, and thus just do that always.
> > We can certainly change some of those bootloaders and OSes, but probably
> > not all of them.
> 
> That's fine by me. Once you get the vma splitting and attributes
> changing done, I think you get the second one for free.
> 
> Do we want to differentiate between Device and Normal Non-cacheable
> memory? Something like KVM_MEMSLOT_DEVICE?
> 
> Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
> pgprot_noncached() returns Strongly Ordered memory. For Normal
> Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).
> 
> Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to

Sounds good to me. I'll rename for the next round.

> ensure that guest and host access it coherently (which would mean
> writecombine for ARM). That's similar naming to functions like
> dma_alloc_coherent() that return cacheable or non-cacheable memory based
> on what the device supports. Anyway, I'm not to bothered with the
> naming.
>

Thanks for your help!

drew
Christoffer Dall May 20, 2015, 10:01 a.m. UTC | #8
On Tue, May 19, 2015 at 12:18:54PM +0100, Catalin Marinas wrote:
> On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> > On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > > Another way would be to split the vma containing the non-cacheable
> > > memory so that you get a single vma with the vm_page_prot as
> > > Non-cacheable.
> > 
> > This sounds interesting. Actually, it even crossed my mind once when I
> > first saw that the vma would overwrite the attributes, but then, sigh,
> > I let my brain take a stupidity bath.
> > 
> > > 
> > > Yet another approach could be for KVM to mmap the necessary memory for
> > > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > > the guest "RAM").
> > 
> > I guess I prefer the vma splitting, rather than this (the vma creating
> > with mmap), as it keeps the KVM interface from changing (as you point out
> > below). Well, unless there are other advantages to this that are worth
> > considering?
> 
> The advantage is that you don't need to deal with the mm internals in
> the KVM code.
> 
> But you can probably add such code directly to mm/ and reuse some of the
> existing code in there already as part of change_protection(),
> mprotect_fixup(), sys_mprotect(). Actually, once you split the vma and
> set the new protection (something similar to mprotect_fixup), it looks
> to me like you can just call change_protection(vma->vm_page_prot).
> 
> > > I didn't have time to follow these threads in details, but just to
> > > recap my understanding, we have two main use-cases:
> > > 
> > > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > > 2. Qemu emulating device DMA
> > > 
> > > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > > this memory slot. The memory attributes in this case could be Device
> > > because that's how the guest would normally map it. The
> > > file_operations.mmap trick would work in this case but this means
> > > expanding the KVM ABI beyond just an ioctl().
> > > 
> > > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > > framebuffer allocated by the guest), I still think the simplest is to
> > > tell the guest (via DT) that such device is cache coherent rather than
> > > trying to remap the Qemu mapping as non-cacheable.
> > 
> > If we need a solution for (1), then I'd prefer that it work and be
> > applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> > all guest types (booloaders, different OSes) to listen to us. They may
> > assume non-cacheable is typical and safe, and thus just do that always.
> > We can certainly change some of those bootloaders and OSes, but probably
> > not all of them.
> 
> That's fine by me. Once you get the vma splitting and attributes
> changing done, I think you get the second one for free.
> 
> Do we want to differentiate between Device and Normal Non-cacheable
> memory? Something like KVM_MEMSLOT_DEVICE?
> 
> Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
> pgprot_noncached() returns Strongly Ordered memory. For Normal
> Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).
> 
> Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to
> ensure that guest and host access it coherently (which would mean
> writecombine for ARM). That's similar naming to functions like
> dma_alloc_coherent() that return cacheable or non-cacheable memory based
> on what the device supports. Anyway, I'm not to bothered with the
> naming.
> 
One thing to keep in mind for (2) is that QEMU is likely to do things
like calling regular memcpy() on the memory region, so mapping it as
device memory which would fault on unaligned accesses may be a problem,
so ideally there is a memory type for the user space mapping which
allows such behavior where we at the same time can guarantee the that
the mapping is coherent with the guest mapping through the S2
attributes.

-Christoffer
Catalin Marinas May 20, 2015, 11:24 a.m. UTC | #9
On Wed, May 20, 2015 at 11:01:27AM +0100, Christoffer Dall wrote:
> On Tue, May 19, 2015 at 12:18:54PM +0100, Catalin Marinas wrote:
> > On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> > > On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > > > I didn't have time to follow these threads in details, but just to
> > > > recap my understanding, we have two main use-cases:
> > > > 
> > > > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > > > 2. Qemu emulating device DMA
> > > > 
> > > > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > > > this memory slot. The memory attributes in this case could be Device
> > > > because that's how the guest would normally map it. The
> > > > file_operations.mmap trick would work in this case but this means
> > > > expanding the KVM ABI beyond just an ioctl().
> > > > 
> > > > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > > > framebuffer allocated by the guest), I still think the simplest is to
> > > > tell the guest (via DT) that such device is cache coherent rather than
> > > > trying to remap the Qemu mapping as non-cacheable.
> > > 
> > > If we need a solution for (1), then I'd prefer that it work and be
> > > applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> > > all guest types (booloaders, different OSes) to listen to us. They may
> > > assume non-cacheable is typical and safe, and thus just do that always.
> > > We can certainly change some of those bootloaders and OSes, but probably
> > > not all of them.
> > 
> > That's fine by me. Once you get the vma splitting and attributes
> > changing done, I think you get the second one for free.
> > 
> > Do we want to differentiate between Device and Normal Non-cacheable
> > memory? Something like KVM_MEMSLOT_DEVICE?
> > 
> > Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
> > pgprot_noncached() returns Strongly Ordered memory. For Normal
> > Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).
> > 
> > Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to
> > ensure that guest and host access it coherently (which would mean
> > writecombine for ARM). That's similar naming to functions like
> > dma_alloc_coherent() that return cacheable or non-cacheable memory based
> > on what the device supports. Anyway, I'm not to bothered with the
> > naming.
> > 
> One thing to keep in mind for (2) is that QEMU is likely to do things
> like calling regular memcpy() on the memory region, so mapping it as
> device memory which would fault on unaligned accesses may be a problem,
> so ideally there is a memory type for the user space mapping which
> allows such behavior where we at the same time can guarantee the that
> the mapping is coherent with the guest mapping through the S2
> attributes.

I agree, for (2) we need Normal memory (either cacheable or
non-cacheable, though as I can see it's more likely the latter as we
can't guarantee the guest honouring "dma-coherent" device properties).
Mario Smarduch May 23, 2015, 1:08 a.m. UTC | #10
On 05/18/2015 08:53 AM, Catalin Marinas wrote:
> On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
>> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
>>> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
>>>> Provide a method to change normal, cacheable memory to non-cacheable.
>>>> KVM will make use of this to keep emulated device memory regions
>>>> coherent with the guest.
>>>>
>>>> Signed-off-by: Andrew Jones <drjones@redhat.com>
>>>
>>> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
>>>
>>> But you obviously need Russell and Will/Catalin to ack/merge this.
>>
>> I guess this patch is going to go away in the next round. You've
>> pointed out that I screwed stuff up royally with my over eagerness
>> to reuse code. I need to reimplement change_memory_common, but a
>> version that takes an mm, which is more or less what I did in the
>> last version of this series, back when I was pinning pages.
> 
> I kept wondering what this patch and the next one are doing with
> set_memory_nc(). Basically you were trying to set the cache attributes
> for the kernel linear mapping or kmap address (in the 32-bit arm case,
> which is removed anyway on kunmap).
> 
> What you need is changing the attributes of the user mapping as accessed
> by Qemu but I don't think simply re-implementing change_memory_common()
> would work, you probably need to pin the pages in memory as well.
> Otherwise, the kernel may remove such pages and, when bringing them
> back, would set the default cacheability attributes.
> 
> Another way would be to split the vma containing the non-cacheable
> memory so that you get a single vma with the vm_page_prot as
> Non-cacheable.
> 
> Yet another approach could be for KVM to mmap the necessary memory for
> Qemu via a file_operations.mmap call (but that's only for ranges outside
> the guest "RAM").

I think this option with a basic loadable driver
that allocates non-cachable/pinned pages for QEMU to mmap()
may provide a reference point to build on. If that covers
all the cases then perhaps move to more generic solution. This
should be quicker to implement and test.

I wonder if kernel mm will ever have a reason
to create a cacheable mapping even if the pages are pinned?
Like reading /dev/mem although that's not a likely case here.

- Mario

> 
> I didn't have time to follow these threads in details, but just to
> recap my understanding, we have two main use-cases:
> 
> 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> 2. Qemu emulating device DMA
> 
> For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> this memory slot. The memory attributes in this case could be Device
> because that's how the guest would normally map it. The
> file_operations.mmap trick would work in this case but this means
> expanding the KVM ABI beyond just an ioctl().
> 
> For (2), since Qemu is writing to the guest "RAM" (e.g. video
> framebuffer allocated by the guest), I still think the simplest is to
> tell the guest (via DT) that such device is cache coherent rather than
> trying to remap the Qemu mapping as non-cacheable.
>
Andrew Jones May 25, 2015, 5:11 p.m. UTC | #11
On Fri, May 22, 2015 at 06:08:30PM -0700, Mario Smarduch wrote:
> On 05/18/2015 08:53 AM, Catalin Marinas wrote:
> > On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
> >> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> >>> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> >>>> Provide a method to change normal, cacheable memory to non-cacheable.
> >>>> KVM will make use of this to keep emulated device memory regions
> >>>> coherent with the guest.
> >>>>
> >>>> Signed-off-by: Andrew Jones <drjones@redhat.com>
> >>>
> >>> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> >>>
> >>> But you obviously need Russell and Will/Catalin to ack/merge this.
> >>
> >> I guess this patch is going to go away in the next round. You've
> >> pointed out that I screwed stuff up royally with my over eagerness
> >> to reuse code. I need to reimplement change_memory_common, but a
> >> version that takes an mm, which is more or less what I did in the
> >> last version of this series, back when I was pinning pages.
> > 
> > I kept wondering what this patch and the next one are doing with
> > set_memory_nc(). Basically you were trying to set the cache attributes
> > for the kernel linear mapping or kmap address (in the 32-bit arm case,
> > which is removed anyway on kunmap).
> > 
> > What you need is changing the attributes of the user mapping as accessed
> > by Qemu but I don't think simply re-implementing change_memory_common()
> > would work, you probably need to pin the pages in memory as well.
> > Otherwise, the kernel may remove such pages and, when bringing them
> > back, would set the default cacheability attributes.
> > 
> > Another way would be to split the vma containing the non-cacheable
> > memory so that you get a single vma with the vm_page_prot as
> > Non-cacheable.
> > 
> > Yet another approach could be for KVM to mmap the necessary memory for
> > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > the guest "RAM").
> 
> I think this option with a basic loadable driver
> that allocates non-cachable/pinned pages for QEMU to mmap()
> may provide a reference point to build on. If that covers
> all the cases then perhaps move to more generic solution. This
> should be quicker to implement and test.

I've pulled together a different approach for experimentation. I added a
new mmap/mprotect prot flag, like what was done for the powerpc SAO bit
(see commit b845f313d78e4e "mm: Allow architectures to define additional
protection bits" and commit ef3d3246a0d06 " powerpc/mm: Add Strong Access
Ordering support"). So far I've only tested with a simple test program that
forks and messes around with mapping shared memory in different ways. With
some cache flushing added to set_pte_at on memattr changes, then it seems
to work pretty well.

Now I'll start experimenting with QEMU again to see if the "map QEMU's
memory as noncacheable" approach makes any sense at all. If it does,
then I'm not sure we want to do it with mprotect, but I could clean up
the patches and post them for discussion. The main problems I see with it
are the need to define a new PROT_ flag, and the fact that it might not
be a good idea for userspace to have this capability in the first place,
at least not for MAP_SHARED regions.

> 
> I wonder if kernel mm will ever have a reason
> to create a cacheable mapping even if the pages are pinned?
> Like reading /dev/mem although that's not a likely case here.

Actually for a version with pinned pages, then I think this one
http://www.spinics.net/lists/kvm-arm/msg14021.html
is a good start. There are some problems with it, but I can fix
them in order for it to be useful for experimenting with QEMU. It
suffers from the Device-nGnRnE issue Peter and Alex pointed out, and
I also see that I'm using the wrong address in there as well for the
dcache flush, because I'm using kvm_flush_dcache_pte(*ptep) in
set_page_uncached, which ends up using page_address(page). I should
just do __flush_dcache_area(addr, PAGE_SIZE), instead.

Thanks,
drew
Mario Smarduch May 27, 2015, 1:08 a.m. UTC | #12
On 05/25/2015 10:11 AM, Andrew Jones wrote:
> On Fri, May 22, 2015 at 06:08:30PM -0700, Mario Smarduch wrote:
>> On 05/18/2015 08:53 AM, Catalin Marinas wrote:
>>> On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
>>>> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
>>>>> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
>>>>>> Provide a method to change normal, cacheable memory to non-cacheable.
>>>>>> KVM will make use of this to keep emulated device memory regions
>>>>>> coherent with the guest.
>>>>>>
>>>>>> Signed-off-by: Andrew Jones <drjones@redhat.com>
>>>>>
>>>>> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
>>>>>
>>>>> But you obviously need Russell and Will/Catalin to ack/merge this.
>>>>
>>>> I guess this patch is going to go away in the next round. You've
>>>> pointed out that I screwed stuff up royally with my over eagerness
>>>> to reuse code. I need to reimplement change_memory_common, but a
>>>> version that takes an mm, which is more or less what I did in the
>>>> last version of this series, back when I was pinning pages.
>>>
>>> I kept wondering what this patch and the next one are doing with
>>> set_memory_nc(). Basically you were trying to set the cache attributes
>>> for the kernel linear mapping or kmap address (in the 32-bit arm case,
>>> which is removed anyway on kunmap).
>>>
>>> What you need is changing the attributes of the user mapping as accessed
>>> by Qemu but I don't think simply re-implementing change_memory_common()
>>> would work, you probably need to pin the pages in memory as well.
>>> Otherwise, the kernel may remove such pages and, when bringing them
>>> back, would set the default cacheability attributes.
>>>
>>> Another way would be to split the vma containing the non-cacheable
>>> memory so that you get a single vma with the vm_page_prot as
>>> Non-cacheable.
>>>
>>> Yet another approach could be for KVM to mmap the necessary memory for
>>> Qemu via a file_operations.mmap call (but that's only for ranges outside
>>> the guest "RAM").
>>
>> I think this option with a basic loadable driver
>> that allocates non-cachable/pinned pages for QEMU to mmap()
>> may provide a reference point to build on. If that covers
>> all the cases then perhaps move to more generic solution. This
>> should be quicker to implement and test.
> 
> I've pulled together a different approach for experimentation. I added a
> new mmap/mprotect prot flag, like what was done for the powerpc SAO bit
> (see commit b845f313d78e4e "mm: Allow architectures to define additional
> protection bits" and commit ef3d3246a0d06 " powerpc/mm: Add Strong Access
> Ordering support"). So far I've only tested with a simple test program that
> forks and messes around with mapping shared memory in different ways. With
> some cache flushing added to set_pte_at on memattr changes, then it seems
> to work pretty well.

Thanks for the pointer, it's pretty deep into generic mmap_region()
code. Does SAO apply to regular memory or MMIO regions?

> 
> Now I'll start experimenting with QEMU again to see if the "map QEMU's
> memory as noncacheable" approach makes any sense at all. If it does,
> then I'm not sure we want to do it with mprotect, but I could clean up
> the patches and post them for discussion. The main problems I see with it
> are the need to define a new PROT_ flag, and the fact that it might not
> be a good idea for userspace to have this capability in the first place,
> at least not for MAP_SHARED regions.

Seem like pretty significant incisions to generic kernel.
Appears like below patches do same thing without adding
arch specific vma protection extensions. You would need
to lock the region pages, right? Also flush the TLBs
after locking. Appears to make a general mmap() interface
unique for this case.

> 
>>
>> I wonder if kernel mm will ever have a reason
>> to create a cacheable mapping even if the pages are pinned?
>> Like reading /dev/mem although that's not a likely case here.
> 
> Actually for a version with pinned pages, then I think this one
> http://www.spinics.net/lists/kvm-arm/msg14021.html
> is a good start. There are some problems with it, but I can fix
> them in order for it to be useful for experimenting with QEMU. It
> suffers from the Device-nGnRnE issue Peter and Alex pointed out, and
> I also see that I'm using the wrong address in there as well for the
> dcache flush, because I'm using kvm_flush_dcache_pte(*ptep) in
> set_page_uncached, which ends up using page_address(page). I should
> just do __flush_dcache_area(addr, PAGE_SIZE), instead.

Sorry I missed this series it appears clean, with no excessive flushing,
and the few additional modifications you pointed out. Instead of
flush_tlb_kernel_range(), maybe use flush_tlb_all(),
I'm not sure vaae1is flushes any other alias mappings to
same pfn.

Thanks,
- Mario

> 
> Thanks,
> drew
>
diff mbox

Patch

diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
index 2d46862e7bef7..682a8b13d6019 100644
--- a/arch/arm/include/asm/cacheflush.h
+++ b/arch/arm/include/asm/cacheflush.h
@@ -486,6 +486,7 @@  int set_memory_ro(unsigned long addr, int numpages);
 int set_memory_rw(unsigned long addr, int numpages);
 int set_memory_x(unsigned long addr, int numpages);
 int set_memory_nx(unsigned long addr, int numpages);
+int set_memory_nc(unsigned long addr, int numpages);
 
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
diff --git a/arch/arm/mm/pageattr.c b/arch/arm/mm/pageattr.c
index cf30daff89325..9f9f752cab871 100644
--- a/arch/arm/mm/pageattr.c
+++ b/arch/arm/mm/pageattr.c
@@ -92,3 +92,10 @@  int set_memory_x(unsigned long addr, int numpages)
 					__pgprot(0),
 					__pgprot(L_PTE_XN));
 }
+
+int set_memory_nc(unsigned long addr, int numpages)
+{
+	return change_memory_common(addr, numpages,
+					__pgprot(L_PTE_MT_BUFFERABLE),
+					__pgprot(L_PTE_MT_MASK));
+}
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 67d309cc3b6b8..ef671f38c19ad 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -152,6 +152,7 @@  int set_memory_ro(unsigned long addr, int numpages);
 int set_memory_rw(unsigned long addr, int numpages);
 int set_memory_x(unsigned long addr, int numpages);
 int set_memory_nx(unsigned long addr, int numpages);
+int set_memory_nc(unsigned long addr, int numpages);
 
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index e47ed1c5dce1b..c837adcf26fc6 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -96,3 +96,11 @@  int set_memory_x(unsigned long addr, int numpages)
 					__pgprot(PTE_PXN));
 }
 EXPORT_SYMBOL_GPL(set_memory_x);
+
+int set_memory_nc(unsigned long addr, int numpages)
+{
+	return change_memory_common(addr, numpages,
+					__pgprot(PTE_ATTRINDX(MT_NORMAL_NC)),
+					__pgprot(PTE_ATTRINDX_MASK));
+}
+EXPORT_SYMBOL_GPL(set_memory_nc);