[18/19] KVM: PPC: Book3S HV: add passthrough support
diff mbox series

Message ID 20190107191006.10648-2-clg@kaod.org
State Changes Requested
Headers show
Series
  • KVM: PPC: Book3S HV: add XIVE native exploitation mode
Related show

Commit Message

Cédric Le Goater Jan. 7, 2019, 7:10 p.m. UTC
Clear the ESB pages from the VMA of the IRQ being pass through to the
guest and let the fault handler repopulate the VMA when the ESB pages
are accessed for an EOI or for a trigger.

Storing the VMA under the KVM XIVE device is a little ugly.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.h        |  8 +++++++
 arch/powerpc/kvm/book3s_xive.c        | 15 ++++++++++++++
 arch/powerpc/kvm/book3s_xive_native.c | 30 +++++++++++++++++++++++++++
 3 files changed, 53 insertions(+)

Comments

Paul Mackerras Jan. 22, 2019, 5:26 a.m. UTC | #1
On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
> Clear the ESB pages from the VMA of the IRQ being pass through to the
> guest and let the fault handler repopulate the VMA when the ESB pages
> are accessed for an EOI or for a trigger.

Why do we want to do this?

I don't see any possible advantage to removing the PTEs from the
userspace mapping.  You'll need to explain further.

Paul.
Benjamin Herrenschmidt Jan. 23, 2019, 6:45 a.m. UTC | #2
On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
> > Clear the ESB pages from the VMA of the IRQ being pass through to the
> > guest and let the fault handler repopulate the VMA when the ESB pages
> > are accessed for an EOI or for a trigger.
> 
> Why do we want to do this?
> 
> I don't see any possible advantage to removing the PTEs from the
> userspace mapping.  You'll need to explain further.

Afaik bcs we change the mapping to point to the real HW irq ESB page
instead of the "IPI" that was there at VM init time.

Cedric, is that correct ?

Cheers,
Ben.
Paul Mackerras Jan. 23, 2019, 10:30 a.m. UTC | #3
On Wed, Jan 23, 2019 at 05:45:24PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
> > > Clear the ESB pages from the VMA of the IRQ being pass through to the
> > > guest and let the fault handler repopulate the VMA when the ESB pages
> > > are accessed for an EOI or for a trigger.
> > 
> > Why do we want to do this?
> > 
> > I don't see any possible advantage to removing the PTEs from the
> > userspace mapping.  You'll need to explain further.
> 
> Afaik bcs we change the mapping to point to the real HW irq ESB page
> instead of the "IPI" that was there at VM init time.

So that makes it sound like there is a whole lot going on that hasn't
even been hinted at in the patch descriptions...  It sounds like we
need a good description of how all this works and fits together
somewhere under Documentation/.

In any case we need much more informative patch descriptions.  I
realize that it's all currently in Cedric's head, but I bet that in
two or three years' time when we come to try to debug something, it
won't be in anyone's head...

Paul.
Cédric Le Goater Jan. 23, 2019, 11:07 a.m. UTC | #4
On 1/23/19 11:30 AM, Paul Mackerras wrote:
> On Wed, Jan 23, 2019 at 05:45:24PM +1100, Benjamin Herrenschmidt wrote:
>> On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
>>> On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
>>>> Clear the ESB pages from the VMA of the IRQ being pass through to the
>>>> guest and let the fault handler repopulate the VMA when the ESB pages
>>>> are accessed for an EOI or for a trigger.
>>>
>>> Why do we want to do this?
>>>
>>> I don't see any possible advantage to removing the PTEs from the
>>> userspace mapping.  You'll need to explain further.
>>
>> Afaik bcs we change the mapping to point to the real HW irq ESB page
>> instead of the "IPI" that was there at VM init time.

yes exactly. You need to clean up the pages each time.
 
> So that makes it sound like there is a whole lot going on that hasn't
> even been hinted at in the patch descriptions...  It sounds like we
> need a good description of how all this works and fits together
> somewhere under Documentation/.

OK. I have started doing so for the models merged in QEMU but not yet 
for KVM. I will work on it.

> In any case we need much more informative patch descriptions.  I
> realize that it's all currently in Cedric's head, but I bet that in
> two or three years' time when we come to try to debug something, it
> won't be in anyone's head...

I agree. 


So, storing the ESB VMA under the KVM device is not shocking anyone ?  

Thanks,

C.
Benjamin Herrenschmidt Jan. 23, 2019, 9:25 p.m. UTC | #5
On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
> > Afaik bcs we change the mapping to point to the real HW irq ESB page
> > instead of the "IPI" that was there at VM init time.
> 
> So that makes it sound like there is a whole lot going on that hasn't
> even been hinted at in the patch descriptions...  It sounds like we
> need a good description of how all this works and fits together
> somewhere under Documentation/.
> 
> In any case we need much more informative patch descriptions.  I
> realize that it's all currently in Cedric's head, but I bet that in
> two or three years' time when we come to try to debug something, it
> won't be in anyone's head...

The main problem is understanding XIVE itself. It's not realistic to
ask Cedric to write a proper documentation for XIVE as part of the
patch series, but sadly IBM doesn't have a good one to provide either.

Ben.
Cédric Le Goater Jan. 24, 2019, 8:41 a.m. UTC | #6
On 1/23/19 10:25 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
>>> Afaik bcs we change the mapping to point to the real HW irq ESB page
>>> instead of the "IPI" that was there at VM init time.
>>
>> So that makes it sound like there is a whole lot going on that hasn't
>> even been hinted at in the patch descriptions...  It sounds like we
>> need a good description of how all this works and fits together
>> somewhere under Documentation/.
>>
>> In any case we need much more informative patch descriptions.  I
>> realize that it's all currently in Cedric's head, but I bet that in
>> two or three years' time when we come to try to debug something, it
>> won't be in anyone's head...
> 
> The main problem is understanding XIVE itself. It's not realistic to
> ask Cedric to write a proper documentation for XIVE as part of the
> patch series, but sadly IBM doesn't have a good one to provide either.

QEMU has a preliminary introduction we could use :

https://git.qemu.org/?p=qemu.git;a=blob;f=include/hw/ppc/xive.h;h=ec23253ba448e25c621356b55a7777119a738f8e;hb=HEAD

With some extensions for sPAPR and KVM, the resulting file could 
be moved to the Linux documentation directory. This would be an
iterative process over time of course. 

Cheers,

C.
Paul Mackerras Jan. 28, 2019, 4:43 a.m. UTC | #7
On Thu, Jan 24, 2019 at 08:25:15AM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
> > > Afaik bcs we change the mapping to point to the real HW irq ESB page
> > > instead of the "IPI" that was there at VM init time.
> > 
> > So that makes it sound like there is a whole lot going on that hasn't
> > even been hinted at in the patch descriptions...  It sounds like we
> > need a good description of how all this works and fits together
> > somewhere under Documentation/.
> > 
> > In any case we need much more informative patch descriptions.  I
> > realize that it's all currently in Cedric's head, but I bet that in
> > two or three years' time when we come to try to debug something, it
> > won't be in anyone's head...
> 
> The main problem is understanding XIVE itself. It's not realistic to
> ask Cedric to write a proper documentation for XIVE as part of the
> patch series, but sadly IBM doesn't have a good one to provide either.

There are: (a) the XIVE hardware, (b) the definition of the XIVE
hypercalls that guests use, and (c) the design decisions around how to
implement that hypercall interface.  We need to get (b) published
somehow, but it is mostly (c) that I would expect the patch
descriptions to explain.

It sounds like there will be a mapping to userspace where the pages
can sometimes point to an IPI page and sometimes point to a real HW
irq ESB page.  That is, the same guest "hardware" irq number sometimes
refers to a software-generated interrupt (what you called an "IPI"
above) and sometimes to a hardware-generated interrupt.  That fact,
the reason why it is so and the consequences all need to be explained
somewhere.  They are really not obvious and I don't believe they are
part of either the XIVE hardware spec or the XIVE hypercall spec.

Paul.
Paul Mackerras Jan. 28, 2019, 6:13 a.m. UTC | #8
On Wed, Jan 23, 2019 at 12:07:19PM +0100, Cédric Le Goater wrote:
> On 1/23/19 11:30 AM, Paul Mackerras wrote:
> > On Wed, Jan 23, 2019 at 05:45:24PM +1100, Benjamin Herrenschmidt wrote:
> >> On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
> >>> On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
> >>>> Clear the ESB pages from the VMA of the IRQ being pass through to the
> >>>> guest and let the fault handler repopulate the VMA when the ESB pages
> >>>> are accessed for an EOI or for a trigger.
> >>>
> >>> Why do we want to do this?
> >>>
> >>> I don't see any possible advantage to removing the PTEs from the
> >>> userspace mapping.  You'll need to explain further.
> >>
> >> Afaik bcs we change the mapping to point to the real HW irq ESB page
> >> instead of the "IPI" that was there at VM init time.
> 
> yes exactly. You need to clean up the pages each time.
>  
> > So that makes it sound like there is a whole lot going on that hasn't
> > even been hinted at in the patch descriptions...  It sounds like we
> > need a good description of how all this works and fits together
> > somewhere under Documentation/.
> 
> OK. I have started doing so for the models merged in QEMU but not yet 
> for KVM. I will work on it.
> 
> > In any case we need much more informative patch descriptions.  I
> > realize that it's all currently in Cedric's head, but I bet that in
> > two or three years' time when we come to try to debug something, it
> > won't be in anyone's head...
> 
> I agree. 
> 
> 
> So, storing the ESB VMA under the KVM device is not shocking anyone ?  

Actually, now that I think of it, why can't userspace (QEMU) manage
this using mmap()?  Based on what Ben has said, I assume there would
be a pair of pages for each interrupt that a PCI pass-through device
has.  Would we end up with too many VMAs if we just used mmap() to
change the mappings from the software-generated pages to the
hardware-generated interrupt pages?  Are the necessary pages for a PCI
passthrough device contiguous in both host real space and guest real
space?  If so we'd only need one mmap() for all the device's interrupt
pages.

Paul.
Cédric Le Goater Jan. 28, 2019, 6:26 p.m. UTC | #9
On 1/28/19 7:13 AM, Paul Mackerras wrote:
> On Wed, Jan 23, 2019 at 12:07:19PM +0100, Cédric Le Goater wrote:
>> On 1/23/19 11:30 AM, Paul Mackerras wrote:
>>> On Wed, Jan 23, 2019 at 05:45:24PM +1100, Benjamin Herrenschmidt wrote:
>>>> On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
>>>>> On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
>>>>>> Clear the ESB pages from the VMA of the IRQ being pass through to the
>>>>>> guest and let the fault handler repopulate the VMA when the ESB pages
>>>>>> are accessed for an EOI or for a trigger.
>>>>>
>>>>> Why do we want to do this?
>>>>>
>>>>> I don't see any possible advantage to removing the PTEs from the
>>>>> userspace mapping.  You'll need to explain further.
>>>>
>>>> Afaik bcs we change the mapping to point to the real HW irq ESB page
>>>> instead of the "IPI" that was there at VM init time.
>>
>> yes exactly. You need to clean up the pages each time.
>>  
>>> So that makes it sound like there is a whole lot going on that hasn't
>>> even been hinted at in the patch descriptions...  It sounds like we
>>> need a good description of how all this works and fits together
>>> somewhere under Documentation/.
>>
>> OK. I have started doing so for the models merged in QEMU but not yet 
>> for KVM. I will work on it.
>>
>>> In any case we need much more informative patch descriptions.  I
>>> realize that it's all currently in Cedric's head, but I bet that in
>>> two or three years' time when we come to try to debug something, it
>>> won't be in anyone's head...
>>
>> I agree. 
>>
>>
>> So, storing the ESB VMA under the KVM device is not shocking anyone ?  
> 
> Actually, now that I think of it, why can't userspace (QEMU) manage
> this using mmap()?  Based on what Ben has said, I assume there would
> be a pair of pages for each interrupt that a PCI pass-through device
> has. 

Yes. there is a pair of ESB pages per IRQ number.

> Would we end up with too many VMAs if we just used mmap() to
> change the mappings from the software-generated pages to the
> hardware-generated interrupt pages?  
The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
dedicated to CPU IPIs and the remaining 4K are for devices. We can 
extend the last range if needed as these are for MSIs. Dynamic 
extensions under KVM should work also.

This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
lot of mmap(), too much. Also the KVM model needs to be compatible
with the QEMU emulated one and it was simpler to have one overall
memory region for the IPI ESBs, one for the END ESBs (if we support
that one day) and one for the TIMA.

> Are the necessary pages for a PCI
> passthrough device contiguous in both host real space 

They should as they are the PHB4 ESBs.

> and guest real space ? 

also. That's how we organized the mapping. 

> If so we'd only need one mmap() for all the device's interrupt
> pages.

Ah. So we would want to make a special case for the passthrough 
device and have a mmap() and a memory region for its ESBs. Hmm.

Wouldn't that require to hot plug a memory region under the guest ? 
which means that we need to provision an address space/container 
region for theses regions. What are the benefits ? 

Is clearing the PTEs and repopulating the VMA unsafe ? 

Thanks,     

C.
Paul Mackerras Jan. 29, 2019, 2:45 a.m. UTC | #10
On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> On 1/28/19 7:13 AM, Paul Mackerras wrote:
> > Would we end up with too many VMAs if we just used mmap() to
> > change the mappings from the software-generated pages to the
> > hardware-generated interrupt pages?  
> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
> dedicated to CPU IPIs and the remaining 4K are for devices. We can 

Confused.  You say the number space has 32768 entries but then imply
there are only 8K entries.  Do you mean that the API allows for 15-bit
IRQ numbers but we are only making using of 8192 of them?

> extend the last range if needed as these are for MSIs. Dynamic 
> extensions under KVM should work also.
> 
> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
> lot of mmap(), too much. Also the KVM model needs to be compatible

I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
space would be covered by a single mmap, overlaid by subsequent mmaps
where we need to map real device interrupts.

> with the QEMU emulated one and it was simpler to have one overall
> memory region for the IPI ESBs, one for the END ESBs (if we support
> that one day) and one for the TIMA.
> 
> > Are the necessary pages for a PCI
> > passthrough device contiguous in both host real space 
> 
> They should as they are the PHB4 ESBs.
> 
> > and guest real space ? 
> 
> also. That's how we organized the mapping. 

"How we organized the mapping" is a significant design decision that I
haven't seen documented anywhere, and is really needed for
understanding what's going on.

> 
> > If so we'd only need one mmap() for all the device's interrupt
> > pages.
> 
> Ah. So we would want to make a special case for the passthrough 
> device and have a mmap() and a memory region for its ESBs. Hmm.
> 
> Wouldn't that require to hot plug a memory region under the guest ? 

No; the way that a memory region works is that userspace can do
whatever disparate mappings it likes within the region on the user
process side, and the corresponding region of guest real address space
follows that automatically.

> which means that we need to provision an address space/container 
> region for theses regions. What are the benefits ? 
> 
> Is clearing the PTEs and repopulating the VMA unsafe ? 

Explicitly unmapping parts of the VMA seems like the wrong way to do
it.  If you have a device mmap where the device wants to change the
physical page underlying parts of the mapping, there should be a way
for it to do that explicitly (but I don't know off the top of my head
what the interface to do that is).

However, I still haven't seen a clear and concise explanation of what
is being changed, when, and why we need to do that.

Paul.
Paul Mackerras Jan. 29, 2019, 4:12 a.m. UTC | #11
On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> 
> Is clearing the PTEs and repopulating the VMA unsafe ? 

Actually, now that I come to think of it, there could be any number of
VMAs (well, up to almost 64k of them), since once you have a file
descriptor you can call mmap on it multiple times.

The more I think about it, the more I think that getting userspace to
manage the mappings with mmap() and munmap() is the right idea if it
can be made to work.

Paul.
Cédric Le Goater Jan. 29, 2019, 1:46 p.m. UTC | #12
On 1/28/19 5:43 AM, Paul Mackerras wrote:
> On Thu, Jan 24, 2019 at 08:25:15AM +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
>>>> Afaik bcs we change the mapping to point to the real HW irq ESB page
>>>> instead of the "IPI" that was there at VM init time.
>>>
>>> So that makes it sound like there is a whole lot going on that hasn't
>>> even been hinted at in the patch descriptions...  It sounds like we
>>> need a good description of how all this works and fits together
>>> somewhere under Documentation/.
>>>
>>> In any case we need much more informative patch descriptions.  I
>>> realize that it's all currently in Cedric's head, but I bet that in
>>> two or three years' time when we come to try to debug something, it
>>> won't be in anyone's head...
>>
>> The main problem is understanding XIVE itself. It's not realistic to
>> ask Cedric to write a proper documentation for XIVE as part of the
>> patch series, but sadly IBM doesn't have a good one to provide either.
> 
> There are: (a) the XIVE hardware, (b) the definition of the XIVE
> hypercalls that guests use, and (c) the design decisions around how to
> implement that hypercall interface.  We need to get (b) published
> somehow, but it is mostly (c) that I would expect the patch
> descriptions to explain.
> 
> It sounds like there will be a mapping to userspace where the pages
> can sometimes point to an IPI page and sometimes point to a real HW
> irq ESB page.  

Just to be clear. In both cases, these pages are real HW ESB pages. 
They are just attached to a different controller : the XIVE IC for 
the IPIs and the PHB4 for the others. 

> That is, the same guest "hardware" irq number sometimes
> refers to a software-generated interrupt (what you called an "IPI"
> above) and sometimes to a hardware-generated interrupt.  That fact,> the reason why it is so and the consequences all need to be explained
> somewhere.  They are really not obvious and I don't believe they are
> part of either the XIVE hardware spec or the XIVE hypercall spec.

I tried to put the reasons behind the current approach in another 
thread, not saying this is the correct one.

Thanks,

C.
Cédric Le Goater Jan. 29, 2019, 1:47 p.m. UTC | #13
On 1/29/19 3:45 AM, Paul Mackerras wrote:
> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>> On 1/28/19 7:13 AM, Paul Mackerras wrote:
>>> Would we end up with too many VMAs if we just used mmap() to
>>> change the mappings from the software-generated pages to the
>>> hardware-generated interrupt pages?  
>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
>> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
> 
> Confused.  You say the number space has 32768 entries but then imply
> there are only 8K entries.  Do you mean that the API allows for 15-bit
> IRQ numbers but we are only making using of 8192 of them?

Ouch. My bad. Let's do it again. 

The sPAPR IRQ number space is 0x2000 wide :

https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396

The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
devices (which can be extended if needed).

So that's 8192 x 2 ESB pages.

>> extend the last range if needed as these are for MSIs. Dynamic 
>> extensions under KVM should work also.
>>
>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
>> lot of mmap(), too much. Also the KVM model needs to be compatible
> 
> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
> space would be covered by a single mmap, overlaid by subsequent mmaps
> where we need to map real device interrupts.

ok. The same fault handler could be used to populate the VMA with the 
ESB pages. 

But it would mean extra work on the QEMU side, which is not needed 
with this patch. 

>> with the QEMU emulated one and it was simpler to have one overall
>> memory region for the IPI ESBs, one for the END ESBs (if we support
>> that one day) and one for the TIMA.
>>
>>> Are the necessary pages for a PCI
>>> passthrough device contiguous in both host real space 
>>
>> They should as they are the PHB4 ESBs.
>>
>>> and guest real space ? 
>>
>> also. That's how we organized the mapping. 
> 
> "How we organized the mapping" is a significant design decision that I
> haven't seen documented anywhere, and is really needed for
> understanding what's going on.

OK. I will add comments on that. See below for some description.

There is nothing fancy, it's simply indexed with the interrupt number,
like for HW, or for the QEMU XIVE emulated model.

>>> If so we'd only need one mmap() for all the device's interrupt
>>> pages.
>>
>> Ah. So we would want to make a special case for the passthrough 
>> device and have a mmap() and a memory region for its ESBs. Hmm.
>>
>> Wouldn't that require to hot plug a memory region under the guest ? 
> 
> No; the way that a memory region works is that userspace can do
> whatever disparate mappings it likes within the region on the user
> process side, and the corresponding region of guest real address space
> follows that automatically.

yes. I suppose this should work also for 'ram device' memory mappings.

So when the passthrough device is added to the guest, we would add a 
new 'ram device' memory region for the device interrupt ESB pages 
that would overlap with the initial guest ESB pages.  

This is really a different approach.

>> which means that we need to provision an address space/container 
>> region for theses regions. What are the benefits ? 
>>
>> Is clearing the PTEs and repopulating the VMA unsafe ? 
> 
> Explicitly unmapping parts of the VMA seems like the wrong way to do
> it.  If you have a device mmap where the device wants to change the
> physical page underlying parts of the mapping, there should be a way
> for it to do that explicitly (but I don't know off the top of my head
> what the interface to do that is).
> 
> However, I still haven't seen a clear and concise explanation of what
> is being changed, when, and why we need to do that.

Yes. I agree on that. The problem is not very different from what we 
have today with the XICS-over-XIVE glue in KVM. Let me try to explain.


The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
implement an IRQ space for the guest using the generic IPI interrupts 
of the XIVE IC controller. These interrupts are allocated at the OPAL
level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
Interrupt management is performed in the XIVE way: using loads and 
stores on the addresses of the XIVE IPI interrupt ESB pages.

Both KVM devices share the same internal structure caching information 
on the interrupts, among which the xive_irq_data struct containing the 
addresses of the IPI ESB pages and an extra one in case of passthrough. 
The later contains the addresses of the ESB pages of the underlying HW 
controller interrupts, PHB4 in all cases for now.    

A guest when running in the XICS legacy interrupt mode lets the KVM 
XICS-over-XIVE device "handle" interrupt management, that is to perform  
the loads and stores on the addresses of the ESB pages of the guest 
interrupts. 

However, when running in XIVE native exploitation mode, the KVM XIVE 
native device exposes the interrupt ESB pages to the guest and lets 
the guest perform directly the loads and stores. 

The VMA exposing the ESB pages make use of a custom VM fault handler
which role is to populate the VMA with appropriate pages. When a fault
occurs, the guest IRQ number is deduced from the offset, and the ESB 
pages of associated XIVE IPI interrupt are inserted in the VMA (using
the internal structure caching information on the interrupts).

Supporting device passthrough in the guest running in XIVE native 
exploitation mode adds some extra refinements because the ESB pages 
of a different HW controller (PHB4) need to be exposed to the guest 
along with the initial IPI ESB pages of the XIVE IC controller. But
the overall mechanic is the same. 

When the device HW irqs are mapped into or unmapped from the guest
IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
and kvmppc_xive_clr_mapped(), are called to record or clear the 
passthrough interrupt information and to perform the switch.

The approach taken by this patch is to clear the ESB pages of the 
guest IRQ number being mapped and let the VM fault handler repopulate. 
The handler will insert the ESB page corresponding to the HW interrupt 
of the device being passed-through or the initial IPI ESB page if the
device is being removed.   



Thanks,

C.
Cédric Le Goater Jan. 29, 2019, 5:44 p.m. UTC | #14
On 1/29/19 5:12 AM, Paul Mackerras wrote:
> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>>
>> Is clearing the PTEs and repopulating the VMA unsafe ? 
> 
> Actually, now that I come to think of it, there could be any number of
> VMAs (well, up to almost 64k of them), since once you have a file
> descriptor you can call mmap on it multiple times.
> 
> The more I think about it, the more I think that getting userspace to
> manage the mappings with mmap() and munmap() is the right idea if it
> can be made to work.

We might be able to mmap() and munmap() regions of ESB pages in the RTAS 
call "ibm,change-msi".  I think that's the right spot for it. 

Thanks,

C.
Paul Mackerras Jan. 30, 2019, 5:55 a.m. UTC | #15
On Tue, Jan 29, 2019 at 06:44:50PM +0100, Cédric Le Goater wrote:
> On 1/29/19 5:12 AM, Paul Mackerras wrote:
> > On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> >>
> >> Is clearing the PTEs and repopulating the VMA unsafe ? 
> > 
> > Actually, now that I come to think of it, there could be any number of
> > VMAs (well, up to almost 64k of them), since once you have a file
> > descriptor you can call mmap on it multiple times.
> > 
> > The more I think about it, the more I think that getting userspace to
> > manage the mappings with mmap() and munmap() is the right idea if it
> > can be made to work.
> 
> We might be able to mmap() and munmap() regions of ESB pages in the RTAS 
> call "ibm,change-msi".  I think that's the right spot for it. 

I was thinking that the ESB pages should be mmapped for device
interrupts at VM startup or when a device is hot-plugged in, and
munmapped when the device is hot-removed.  Maybe the mmap could be
done in conjunction with the KVM_IRQFD call?

What is the reasoning behind doing it in the ibm,change-msi call?

Paul.
Paul Mackerras Jan. 30, 2019, 6:20 a.m. UTC | #16
On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
> On 1/29/19 3:45 AM, Paul Mackerras wrote:
> > On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> >> On 1/28/19 7:13 AM, Paul Mackerras wrote:
> >>> Would we end up with too many VMAs if we just used mmap() to
> >>> change the mappings from the software-generated pages to the
> >>> hardware-generated interrupt pages?  
> >> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
> >> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
> > 
> > Confused.  You say the number space has 32768 entries but then imply
> > there are only 8K entries.  Do you mean that the API allows for 15-bit
> > IRQ numbers but we are only making using of 8192 of them?
> 
> Ouch. My bad. Let's do it again. 
> 
> The sPAPR IRQ number space is 0x2000 wide :
> 
> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
> 
> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
> devices (which can be extended if needed).
> 
> So that's 8192 x 2 ESB pages.
> 
> >> extend the last range if needed as these are for MSIs. Dynamic 
> >> extensions under KVM should work also.
> >>
> >> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
> >> lot of mmap(), too much. Also the KVM model needs to be compatible
> > 
> > I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
> > space would be covered by a single mmap, overlaid by subsequent mmaps
> > where we need to map real device interrupts.
> 
> ok. The same fault handler could be used to populate the VMA with the 
> ESB pages. 
> 
> But it would mean extra work on the QEMU side, which is not needed 
> with this patch. 

Maybe, but just storing a single vma pointer in our private data is
not a feasible approach.  First, you have no control on the lifetime
of the vma and thus this is a use-after-free waiting to happen, and
secondly, there could be multiple vmas that you need to worry about.
Userspace could do multiple mmaps, or could do mprotect or similar on
part of an existing mmap, which would split the vma for the mmap into
multiple vmas.  You don't get notified about munmap either as far as I
can tell, so the vma is liable to go away.  And it doesn't matter if
QEMU would never do such things; if userspace can provoke a
use-after-free in the kernel using KVM then someone will write a
specially crafted user program to do that.

So we either solve it in userspace, or we have to write and maintain
complex kernel code with deep links into the MM subsystem.  I'd much
rather solve it in userspace.

> >> with the QEMU emulated one and it was simpler to have one overall
> >> memory region for the IPI ESBs, one for the END ESBs (if we support
> >> that one day) and one for the TIMA.
> >>
> >>> Are the necessary pages for a PCI
> >>> passthrough device contiguous in both host real space 
> >>
> >> They should as they are the PHB4 ESBs.
> >>
> >>> and guest real space ? 
> >>
> >> also. That's how we organized the mapping. 
> > 
> > "How we organized the mapping" is a significant design decision that I
> > haven't seen documented anywhere, and is really needed for
> > understanding what's going on.
> 
> OK. I will add comments on that. See below for some description.
> 
> There is nothing fancy, it's simply indexed with the interrupt number,
> like for HW, or for the QEMU XIVE emulated model.
> 
> >>> If so we'd only need one mmap() for all the device's interrupt
> >>> pages.
> >>
> >> Ah. So we would want to make a special case for the passthrough 
> >> device and have a mmap() and a memory region for its ESBs. Hmm.
> >>
> >> Wouldn't that require to hot plug a memory region under the guest ? 
> > 
> > No; the way that a memory region works is that userspace can do
> > whatever disparate mappings it likes within the region on the user
> > process side, and the corresponding region of guest real address space
> > follows that automatically.
> 
> yes. I suppose this should work also for 'ram device' memory mappings.
> 
> So when the passthrough device is added to the guest, we would add a 
> new 'ram device' memory region for the device interrupt ESB pages 
> that would overlap with the initial guest ESB pages.  

Not knowing the QEMU internals all that well, I don't at all
understand why a new ram device is necesssary.  I would see it as a
single virtual area mapping the ESB pages of guest hwirqs that are in
use, and we manage those mappings with mmap and munmap.

> This is really a different approach.
> 
> >> which means that we need to provision an address space/container 
> >> region for theses regions. What are the benefits ? 
> >>
> >> Is clearing the PTEs and repopulating the VMA unsafe ? 
> > 
> > Explicitly unmapping parts of the VMA seems like the wrong way to do
> > it.  If you have a device mmap where the device wants to change the
> > physical page underlying parts of the mapping, there should be a way
> > for it to do that explicitly (but I don't know off the top of my head
> > what the interface to do that is).
> > 
> > However, I still haven't seen a clear and concise explanation of what
> > is being changed, when, and why we need to do that.
> 
> Yes. I agree on that. The problem is not very different from what we 
> have today with the XICS-over-XIVE glue in KVM. Let me try to explain.
> 
> 
> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
> implement an IRQ space for the guest using the generic IPI interrupts 
> of the XIVE IC controller. These interrupts are allocated at the OPAL
> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
> Interrupt management is performed in the XIVE way: using loads and 
> stores on the addresses of the XIVE IPI interrupt ESB pages.
> 
> Both KVM devices share the same internal structure caching information 
> on the interrupts, among which the xive_irq_data struct containing the 
> addresses of the IPI ESB pages and an extra one in case of passthrough. 
> The later contains the addresses of the ESB pages of the underlying HW 
> controller interrupts, PHB4 in all cases for now.    
> 
> A guest when running in the XICS legacy interrupt mode lets the KVM 
> XICS-over-XIVE device "handle" interrupt management, that is to perform  
> the loads and stores on the addresses of the ESB pages of the guest 
> interrupts. 
> 
> However, when running in XIVE native exploitation mode, the KVM XIVE 
> native device exposes the interrupt ESB pages to the guest and lets 
> the guest perform directly the loads and stores. 
> 
> The VMA exposing the ESB pages make use of a custom VM fault handler
> which role is to populate the VMA with appropriate pages. When a fault
> occurs, the guest IRQ number is deduced from the offset, and the ESB 
> pages of associated XIVE IPI interrupt are inserted in the VMA (using
> the internal structure caching information on the interrupts).
> 
> Supporting device passthrough in the guest running in XIVE native 
> exploitation mode adds some extra refinements because the ESB pages 
> of a different HW controller (PHB4) need to be exposed to the guest 
> along with the initial IPI ESB pages of the XIVE IC controller. But
> the overall mechanic is the same. 
> 
> When the device HW irqs are mapped into or unmapped from the guest
> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
> and kvmppc_xive_clr_mapped(), are called to record or clear the 
> passthrough interrupt information and to perform the switch.
> 
> The approach taken by this patch is to clear the ESB pages of the 
> guest IRQ number being mapped and let the VM fault handler repopulate. 
> The handler will insert the ESB page corresponding to the HW interrupt 
> of the device being passed-through or the initial IPI ESB page if the
> device is being removed.   

That's a much better write-up.  Thanks.

Paul.
Cédric Le Goater Jan. 30, 2019, 7:06 a.m. UTC | #17
On 1/30/19 6:55 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 06:44:50PM +0100, Cédric Le Goater wrote:
>> On 1/29/19 5:12 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>>>>
>>>> Is clearing the PTEs and repopulating the VMA unsafe ? 
>>>
>>> Actually, now that I come to think of it, there could be any number of
>>> VMAs (well, up to almost 64k of them), since once you have a file
>>> descriptor you can call mmap on it multiple times.
>>>
>>> The more I think about it, the more I think that getting userspace to
>>> manage the mappings with mmap() and munmap() is the right idea if it
>>> can be made to work.
>>
>> We might be able to mmap() and munmap() regions of ESB pages in the RTAS 
>> call "ibm,change-msi".  I think that's the right spot for it. 
> 
> I was thinking that the ESB pages should be mmapped for device
> interrupts at VM startup or when a device is hot-plugged in, and
> munmapped when the device is hot-removed.  Maybe the mmap could be
> done in conjunction with the KVM_IRQFD call?
> 
> What is the reasoning behind doing it in the ibm,change-msi call?

Because when the device is plugged in, it has no interrupts. These are 
allocated on demand only when the driver is loaded in the guest. 

C.
Cédric Le Goater Jan. 30, 2019, 3:54 p.m. UTC | #18
On 1/30/19 7:20 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
>> On 1/29/19 3:45 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>>>> On 1/28/19 7:13 AM, Paul Mackerras wrote:
>>>>> Would we end up with too many VMAs if we just used mmap() to
>>>>> change the mappings from the software-generated pages to the
>>>>> hardware-generated interrupt pages?  
>>>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
>>>> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
>>>
>>> Confused.  You say the number space has 32768 entries but then imply
>>> there are only 8K entries.  Do you mean that the API allows for 15-bit
>>> IRQ numbers but we are only making using of 8192 of them?
>>
>> Ouch. My bad. Let's do it again. 
>>
>> The sPAPR IRQ number space is 0x2000 wide :
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
>>
>> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
>> devices (which can be extended if needed).
>>
>> So that's 8192 x 2 ESB pages.
>>
>>>> extend the last range if needed as these are for MSIs. Dynamic 
>>>> extensions under KVM should work also.
>>>>
>>>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
>>>> lot of mmap(), too much. Also the KVM model needs to be compatible
>>>
>>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
>>> space would be covered by a single mmap, overlaid by subsequent mmaps
>>> where we need to map real device interrupts.
>>
>> ok. The same fault handler could be used to populate the VMA with the 
>> ESB pages. 
>>
>> But it would mean extra work on the QEMU side, which is not needed 
>> with this patch. 
> 
> Maybe, but just storing a single vma pointer in our private data is
> not a feasible approach.  First, you have no control on the lifetime
> of the vma and thus this is a use-after-free waiting to happen, and
> secondly, there could be multiple vmas that you need to worry about.

I fully agree. That's why I was uncomfortable with the solution. There 
are a few other drivers (GPUs if I recall) doing that but it feels wrong.

> Userspace could do multiple mmaps, or could do mprotect or similar on
> part of an existing mmap, which would split the vma for the mmap into
> multiple vmas.  You don't get notified about munmap either as far as I
> can tell, so the vma is liable to go away.  

yes ...

> And it doesn't matter if
> QEMU would never do such things; if userspace can provoke a
> use-after-free in the kernel using KVM then someone will write a
> specially crafted user program to do that.
> 
> So we either solve it in userspace, or we have to write and maintain
> complex kernel code with deep links into the MM subsystem.  
>
> I'd much rather solve it in userspace.

OK, then. I have been reluctant doing so but it seems there are no
other in-kernel solution. 

>>>> with the QEMU emulated one and it was simpler to have one overall
>>>> memory region for the IPI ESBs, one for the END ESBs (if we support
>>>> that one day) and one for the TIMA.
>>>>
>>>>> Are the necessary pages for a PCI
>>>>> passthrough device contiguous in both host real space 
>>>>
>>>> They should as they are the PHB4 ESBs.
>>>>
>>>>> and guest real space ? 
>>>>
>>>> also. That's how we organized the mapping. 
>>>
>>> "How we organized the mapping" is a significant design decision that I
>>> haven't seen documented anywhere, and is really needed for
>>> understanding what's going on.
>>
>> OK. I will add comments on that. See below for some description.
>>
>> There is nothing fancy, it's simply indexed with the interrupt number,
>> like for HW, or for the QEMU XIVE emulated model.
>>
>>>>> If so we'd only need one mmap() for all the device's interrupt
>>>>> pages.
>>>>
>>>> Ah. So we would want to make a special case for the passthrough 
>>>> device and have a mmap() and a memory region for its ESBs. Hmm.
>>>>
>>>> Wouldn't that require to hot plug a memory region under the guest ? 
>>>
>>> No; the way that a memory region works is that userspace can do
>>> whatever disparate mappings it likes within the region on the user
>>> process side, and the corresponding region of guest real address space
>>> follows that automatically.
>>
>> yes. I suppose this should work also for 'ram device' memory mappings.
>>
>> So when the passthrough device is added to the guest, we would add a 
>> new 'ram device' memory region for the device interrupt ESB pages 
>> that would overlap with the initial guest ESB pages.  
> 
> Not knowing the QEMU internals all that well, I don't at all
> understand why a new ram device is necesssary. 

'ram device' memory regions are of a special type which is used to 
directly map into the guest the result of a mmap() in QEMU. 

This is how we propagate the XIVE ESB pages from HW (and the TIMA) 
to the guest and the Linux kernel. It has other use with VFIO.   

> I would see it as a
> single virtual area mapping the ESB pages of guest hwirqs that are in
> use, and we manage those mappings with mmap and munmap.

Yes I think I understand the idea. I will give a try. I need to find the 
right place to do so in QEMU. See other email thread.

>> This is really a different approach.
>>
>>>> which means that we need to provision an address space/container 
>>>> region for theses regions. What are the benefits ? 
>>>>
>>>> Is clearing the PTEs and repopulating the VMA unsafe ? 
>>>
>>> Explicitly unmapping parts of the VMA seems like the wrong way to do
>>> it.  If you have a device mmap where the device wants to change the
>>> physical page underlying parts of the mapping, there should be a way
>>> for it to do that explicitly (but I don't know off the top of my head
>>> what the interface to do that is).
>>>
>>> However, I still haven't seen a clear and concise explanation of what
>>> is being changed, when, and why we need to do that.
>>
>> Yes. I agree on that. The problem is not very different from what we 
>> have today with the XICS-over-XIVE glue in KVM. Let me try to explain.
>>
>>
>> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
>> implement an IRQ space for the guest using the generic IPI interrupts 
>> of the XIVE IC controller. These interrupts are allocated at the OPAL
>> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
>> Interrupt management is performed in the XIVE way: using loads and 
>> stores on the addresses of the XIVE IPI interrupt ESB pages.
>>
>> Both KVM devices share the same internal structure caching information 
>> on the interrupts, among which the xive_irq_data struct containing the 
>> addresses of the IPI ESB pages and an extra one in case of passthrough. 
>> The later contains the addresses of the ESB pages of the underlying HW 
>> controller interrupts, PHB4 in all cases for now.    
>>
>> A guest when running in the XICS legacy interrupt mode lets the KVM 
>> XICS-over-XIVE device "handle" interrupt management, that is to perform  
>> the loads and stores on the addresses of the ESB pages of the guest 
>> interrupts. 
>>
>> However, when running in XIVE native exploitation mode, the KVM XIVE 
>> native device exposes the interrupt ESB pages to the guest and lets 
>> the guest perform directly the loads and stores. 
>>
>> The VMA exposing the ESB pages make use of a custom VM fault handler
>> which role is to populate the VMA with appropriate pages. When a fault
>> occurs, the guest IRQ number is deduced from the offset, and the ESB 
>> pages of associated XIVE IPI interrupt are inserted in the VMA (using
>> the internal structure caching information on the interrupts).
>>
>> Supporting device passthrough in the guest running in XIVE native 
>> exploitation mode adds some extra refinements because the ESB pages 
>> of a different HW controller (PHB4) need to be exposed to the guest 
>> along with the initial IPI ESB pages of the XIVE IC controller. But
>> the overall mechanic is the same. 
>>
>> When the device HW irqs are mapped into or unmapped from the guest
>> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
>> and kvmppc_xive_clr_mapped(), are called to record or clear the 
>> passthrough interrupt information and to perform the switch.
>>
>> The approach taken by this patch is to clear the ESB pages of the 
>> guest IRQ number being mapped and let the VM fault handler repopulate. 
>> The handler will insert the ESB page corresponding to the HW interrupt 
>> of the device being passed-through or the initial IPI ESB page if the
>> device is being removed.   
> 
> That's a much better write-up.  Thanks.

OK. I will reuse it when time comes.

Thanks,

C.
Paul Mackerras Jan. 31, 2019, 2:48 a.m. UTC | #19
On Wed, Jan 30, 2019 at 04:54:23PM +0100, Cédric Le Goater wrote:
> On 1/30/19 7:20 AM, Paul Mackerras wrote:
> > On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
> >> On 1/29/19 3:45 AM, Paul Mackerras wrote:
> >>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> >>>> On 1/28/19 7:13 AM, Paul Mackerras wrote:
> >>>>> Would we end up with too many VMAs if we just used mmap() to
> >>>>> change the mappings from the software-generated pages to the
> >>>>> hardware-generated interrupt pages?  
> >>>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
> >>>> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
> >>>
> >>> Confused.  You say the number space has 32768 entries but then imply
> >>> there are only 8K entries.  Do you mean that the API allows for 15-bit
> >>> IRQ numbers but we are only making using of 8192 of them?
> >>
> >> Ouch. My bad. Let's do it again. 
> >>
> >> The sPAPR IRQ number space is 0x2000 wide :
> >>
> >> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
> >>
> >> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
> >> devices (which can be extended if needed).
> >>
> >> So that's 8192 x 2 ESB pages.
> >>
> >>>> extend the last range if needed as these are for MSIs. Dynamic 
> >>>> extensions under KVM should work also.
> >>>>
> >>>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
> >>>> lot of mmap(), too much. Also the KVM model needs to be compatible
> >>>
> >>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
> >>> space would be covered by a single mmap, overlaid by subsequent mmaps
> >>> where we need to map real device interrupts.
> >>
> >> ok. The same fault handler could be used to populate the VMA with the 
> >> ESB pages. 
> >>
> >> But it would mean extra work on the QEMU side, which is not needed 
> >> with this patch. 
> > 
> > Maybe, but just storing a single vma pointer in our private data is
> > not a feasible approach.  First, you have no control on the lifetime
> > of the vma and thus this is a use-after-free waiting to happen, and
> > secondly, there could be multiple vmas that you need to worry about.
> 
> I fully agree. That's why I was uncomfortable with the solution. There 
> are a few other drivers (GPUs if I recall) doing that but it feels wrong.

There is the HMM infrastructure (heterogeneous memory model) which
could possibly be used to do this, but it's very heavyweight for the
problem we have here.

> > Userspace could do multiple mmaps, or could do mprotect or similar on
> > part of an existing mmap, which would split the vma for the mmap into
> > multiple vmas.  You don't get notified about munmap either as far as I
> > can tell, so the vma is liable to go away.  
> 
> yes ...
> 
> > And it doesn't matter if
> > QEMU would never do such things; if userspace can provoke a
> > use-after-free in the kernel using KVM then someone will write a
> > specially crafted user program to do that.
> > 
> > So we either solve it in userspace, or we have to write and maintain
> > complex kernel code with deep links into the MM subsystem.  
> >
> > I'd much rather solve it in userspace.
> 
> OK, then. I have been reluctant doing so but it seems there are no
> other in-kernel solution. 

I discussed the problem today with David Gibson and he mentioned that
QEMU does have a lot of freedom in how it assigns the guest hwirq
numbers.  So you may be able to avoid the problem by (for example)
never assigning a hwirq to a VFIO device that has previously been used
for an emulated device (or something like that).

> >>>> with the QEMU emulated one and it was simpler to have one overall
> >>>> memory region for the IPI ESBs, one for the END ESBs (if we support
> >>>> that one day) and one for the TIMA.
> >>>>
> >>>>> Are the necessary pages for a PCI
> >>>>> passthrough device contiguous in both host real space 
> >>>>
> >>>> They should as they are the PHB4 ESBs.
> >>>>
> >>>>> and guest real space ? 
> >>>>
> >>>> also. That's how we organized the mapping. 
> >>>
> >>> "How we organized the mapping" is a significant design decision that I
> >>> haven't seen documented anywhere, and is really needed for
> >>> understanding what's going on.
> >>
> >> OK. I will add comments on that. See below for some description.
> >>
> >> There is nothing fancy, it's simply indexed with the interrupt number,
> >> like for HW, or for the QEMU XIVE emulated model.
> >>
> >>>>> If so we'd only need one mmap() for all the device's interrupt
> >>>>> pages.
> >>>>
> >>>> Ah. So we would want to make a special case for the passthrough 
> >>>> device and have a mmap() and a memory region for its ESBs. Hmm.
> >>>>
> >>>> Wouldn't that require to hot plug a memory region under the guest ? 
> >>>
> >>> No; the way that a memory region works is that userspace can do
> >>> whatever disparate mappings it likes within the region on the user
> >>> process side, and the corresponding region of guest real address space
> >>> follows that automatically.
> >>
> >> yes. I suppose this should work also for 'ram device' memory mappings.
> >>
> >> So when the passthrough device is added to the guest, we would add a 
> >> new 'ram device' memory region for the device interrupt ESB pages 
> >> that would overlap with the initial guest ESB pages.  
> > 
> > Not knowing the QEMU internals all that well, I don't at all
> > understand why a new ram device is necesssary. 
> 
> 'ram device' memory regions are of a special type which is used to 
> directly map into the guest the result of a mmap() in QEMU. 
> 
> This is how we propagate the XIVE ESB pages from HW (and the TIMA) 
> to the guest and the Linux kernel. It has other use with VFIO.   
> 
> > I would see it as a
> > single virtual area mapping the ESB pages of guest hwirqs that are in
> > use, and we manage those mappings with mmap and munmap.
> 
> Yes I think I understand the idea. I will give a try. I need to find the 
> right place to do so in QEMU. See other email thread.
> 
> >> This is really a different approach.
> >>
> >>>> which means that we need to provision an address space/container 
> >>>> region for theses regions. What are the benefits ? 
> >>>>
> >>>> Is clearing the PTEs and repopulating the VMA unsafe ? 
> >>>
> >>> Explicitly unmapping parts of the VMA seems like the wrong way to do
> >>> it.  If you have a device mmap where the device wants to change the
> >>> physical page underlying parts of the mapping, there should be a way
> >>> for it to do that explicitly (but I don't know off the top of my head
> >>> what the interface to do that is).
> >>>
> >>> However, I still haven't seen a clear and concise explanation of what
> >>> is being changed, when, and why we need to do that.
> >>
> >> Yes. I agree on that. The problem is not very different from what we 
> >> have today with the XICS-over-XIVE glue in KVM. Let me try to explain.
> >>
> >>
> >> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
> >> implement an IRQ space for the guest using the generic IPI interrupts 
> >> of the XIVE IC controller. These interrupts are allocated at the OPAL
> >> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
> >> Interrupt management is performed in the XIVE way: using loads and 
> >> stores on the addresses of the XIVE IPI interrupt ESB pages.
> >>
> >> Both KVM devices share the same internal structure caching information 
> >> on the interrupts, among which the xive_irq_data struct containing the 
> >> addresses of the IPI ESB pages and an extra one in case of passthrough. 
> >> The later contains the addresses of the ESB pages of the underlying HW 
> >> controller interrupts, PHB4 in all cases for now.    
> >>
> >> A guest when running in the XICS legacy interrupt mode lets the KVM 
> >> XICS-over-XIVE device "handle" interrupt management, that is to perform  
> >> the loads and stores on the addresses of the ESB pages of the guest 
> >> interrupts. 
> >>
> >> However, when running in XIVE native exploitation mode, the KVM XIVE 
> >> native device exposes the interrupt ESB pages to the guest and lets 
> >> the guest perform directly the loads and stores. 
> >>
> >> The VMA exposing the ESB pages make use of a custom VM fault handler
> >> which role is to populate the VMA with appropriate pages. When a fault
> >> occurs, the guest IRQ number is deduced from the offset, and the ESB 
> >> pages of associated XIVE IPI interrupt are inserted in the VMA (using
> >> the internal structure caching information on the interrupts).
> >>
> >> Supporting device passthrough in the guest running in XIVE native 
> >> exploitation mode adds some extra refinements because the ESB pages 
> >> of a different HW controller (PHB4) need to be exposed to the guest 
> >> along with the initial IPI ESB pages of the XIVE IC controller. But
> >> the overall mechanic is the same. 
> >>
> >> When the device HW irqs are mapped into or unmapped from the guest
> >> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
> >> and kvmppc_xive_clr_mapped(), are called to record or clear the 
> >> passthrough interrupt information and to perform the switch.
> >>
> >> The approach taken by this patch is to clear the ESB pages of the 
> >> guest IRQ number being mapped and let the VM fault handler repopulate. 
> >> The handler will insert the ESB page corresponding to the HW interrupt 
> >> of the device being passed-through or the initial IPI ESB page if the
> >> device is being removed.   
> > 
> > That's a much better write-up.  Thanks.
> 
> OK. I will reuse it when time comes.
> 
> Thanks,
> 
> C. 

Paul.

Patch
diff mbox series

diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 31e598e62589..6e64d3496a2c 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -90,6 +90,11 @@  struct kvmppc_xive_src_block {
 	struct kvmppc_xive_irq_state irq_state[KVMPPC_XICS_IRQ_PER_ICS];
 };
 
+struct kvmppc_xive;
+
+struct kvmppc_xive_ops {
+	int (*reset_mapped)(struct kvm *kvm, unsigned long guest_irq);
+};
 
 struct kvmppc_xive {
 	struct kvm *kvm;
@@ -131,6 +136,9 @@  struct kvmppc_xive {
 
 	/* VC base address for ESBs */
 	u64     vc_base;
+
+	struct kvmppc_xive_ops *ops;
+	struct vm_area_struct *vma;
 };
 
 #define KVMPPC_XIVE_Q_COUNT	8
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index e9f05d9c9ad5..9b4751713554 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -946,6 +946,13 @@  int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq,
 	/* Turn the IPI hard off */
 	xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01);
 
+	/*
+	 * Reset ESB guest mapping. Needed when ESB pages are exposed
+	 * to the guest in XIVE native mode
+	 */
+	if (xive->ops && xive->ops->reset_mapped)
+		xive->ops->reset_mapped(kvm, guest_irq);
+
 	/* Grab info about irq */
 	state->pt_number = hw_irq;
 	state->pt_data = irq_data_get_irq_handler_data(host_data);
@@ -1031,6 +1038,14 @@  int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
 	state->pt_number = 0;
 	state->pt_data = NULL;
 
+	/*
+	 * Reset ESB guest mapping. Needed when ESB pages are exposed
+	 * to the guest in XIVE native mode
+	 */
+	if (xive->ops && xive->ops->reset_mapped) {
+		xive->ops->reset_mapped(kvm, guest_irq);
+	}
+
 	/* Reconfigure the IPI */
 	xive_native_configure_irq(state->ipi_number,
 				  xive_vp(xive, state->act_server),
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 1aefb366df0b..12edac29995e 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -240,6 +240,32 @@  static int kvmppc_xive_native_get_vc_base(struct kvmppc_xive *xive, u64 addr)
 	return 0;
 }
 
+static int kvmppc_xive_native_reset_mapped(struct kvm *kvm, unsigned long irq)
+{
+	struct kvmppc_xive *xive = kvm->arch.xive;
+	struct mm_struct *mm = kvm->mm;
+	struct vm_area_struct *vma = xive->vma;
+	unsigned long address;
+
+	if (irq >= KVMPPC_XIVE_NR_IRQS)
+		return -EINVAL;
+
+	pr_debug("clearing esb pages for girq 0x%lx\n", irq);
+
+	down_read(&mm->mmap_sem);
+	/* TODO: can we clear the PTEs without keeping a VMA pointer ? */
+	if (vma) {
+		address = vma->vm_start + irq * (2ull << PAGE_SHIFT);
+		zap_vma_ptes(vma, address, 2ull << PAGE_SHIFT);
+	}
+	up_read(&mm->mmap_sem);
+	return 0;
+}
+
+static struct kvmppc_xive_ops kvmppc_xive_native_ops =  {
+	.reset_mapped = kvmppc_xive_native_reset_mapped,
+};
+
 static int xive_native_esb_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -292,6 +318,8 @@  static const struct vm_operations_struct xive_native_esb_vmops = {
 
 static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct kvmppc_xive *xive = vma->vm_file->private_data;
+
 	/* There are two ESB pages (trigger and EOI) per IRQ */
 	if (vma_pages(vma) + vma->vm_pgoff > KVMPPC_XIVE_NR_IRQS * 2)
 		return -EINVAL;
@@ -299,6 +327,7 @@  static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma)
 	vma->vm_flags |= VM_IO | VM_PFNMAP;
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_ops = &xive_native_esb_vmops;
+	xive->vma = vma; /* TODO: get rid of the VMA pointer */
 	return 0;
 }
 
@@ -992,6 +1021,7 @@  static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	xive->vc_base = XIVE_VC_BASE;
 
 	xive->single_escalation = xive_native_has_single_escalation();
+	xive->ops = &kvmppc_xive_native_ops;
 
 	if (ret)
 		kfree(xive);