[00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
mbox series

Message ID 20190107184331.8429-1-clg@kaod.org
Headers show
Series
  • KVM: PPC: Book3S HV: add XIVE native exploitation mode
Related show

Message

Cédric Le Goater Jan. 7, 2019, 6:43 p.m. UTC
Hello,

On the POWER9 processor, the XIVE interrupt controller can control
interrupt sources using MMIO to trigger events, to EOI or to turn off
the sources. Priority management and interrupt acknowledgment is also
controlled by MMIO in the CPU presenter subengine.

PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
special support from the hypervisor to do the same. This is called the
XIVE native exploitation mode and today, it can be activated under the
PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
and still offers the old interrupt mode interface using a
XICS-over-XIVE glue which implements the XICS hcalls.

The following series is proposal to add the same support under KVM.

A new KVM device is introduced for the XIVE native exploitation
mode. It reuses most of the XICS-over-XIVE glue implementation
structures which are internal to KVM but has a completely different
interface. A set of Hypervisor calls configures the sources and the
event queues and from there, all control is done by the guest through
MMIOs.

These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
similarly to VFIO, and the associated VMAs are populated dynamically
with the appropriate pages using a fault handler. This is implemented
with a couple of KVM device ioctls.

On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
negotiation process determines whether the guest operates with a
interrupt controller using the XICS legacy model, as found on POWER8,
or in XIVE exploitation mode. Which means that the KVM interrupt
device should be created at runtime, after the machine as started.
This requires extra KVM support to create/destroy KVM devices. The
last patches are an attempt to solve that problem.

Migration has its own specific needs. The patchset provides the
necessary routines to quiesce XIVE, to capture and restore the state
of the different structures used by KVM, OPAL and HW. Extra OPAL
support is required for these.

GitHub trees available here :
 
QEMU sPAPR:

  https://github.com/legoater/qemu/commits/xive-next
  
Linux/KVM:

  https://github.com/legoater/linux/commits/xive-5.0

OPAL:

  https://github.com/legoater/skiboot/commits/xive

Best wishes for 2019 !

C.


Cédric Le Goater (19):
  powerpc/xive: export flags for the XIVE native exploitation mode
    hcalls
  powerpc/xive: add OPAL extensions for the XIVE native exploitation
    support
  KVM: PPC: Book3S HV: check the IRQ controller type
  KVM: PPC: Book3S HV: export services for the XIVE native exploitation
    device
  KVM: PPC: Book3S HV: add a new KVM device for the XIVE native
    exploitation mode
  KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native
    device
  KVM: PPC: Book3S HV: add a GET_TIMA_FD control to XIVE native device
  KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device
  KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native
    device
  KVM: PPC: Book3S HV: add a EISN attribute to kvmppc_xive_irq_state
  KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode
    hcalls
  KVM: PPC: Book3S HV: record guest queue page address
  KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration
  KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty
  KVM: PPC: Book3S HV: add get/set accessors for the source
    configuration
  KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration
  KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state
  KVM: PPC: Book3S HV: add passthrough support
  KVM: introduce a KVM_DELETE_DEVICE ioctl

 arch/powerpc/include/asm/kvm_host.h           |    2 +
 arch/powerpc/include/asm/kvm_ppc.h            |   69 +
 arch/powerpc/include/asm/opal-api.h           |   11 +-
 arch/powerpc/include/asm/opal.h               |    7 +
 arch/powerpc/include/asm/xive.h               |   40 +
 arch/powerpc/include/uapi/asm/kvm.h           |   47 +
 arch/powerpc/kvm/book3s_xive.h                |   82 +
 include/linux/kvm_host.h                      |    2 +
 include/uapi/linux/kvm.h                      |    5 +
 arch/powerpc/kvm/book3s.c                     |   31 +-
 arch/powerpc/kvm/book3s_hv.c                  |   29 +
 arch/powerpc/kvm/book3s_hv_builtin.c          |  196 +++
 arch/powerpc/kvm/book3s_hv_rm_xive_native.c   |   47 +
 arch/powerpc/kvm/book3s_xive.c                |  149 +-
 arch/powerpc/kvm/book3s_xive_native.c         | 1406 +++++++++++++++++
 .../powerpc/kvm/book3s_xive_native_template.c |  398 +++++
 arch/powerpc/kvm/powerpc.c                    |   30 +
 arch/powerpc/sysdev/xive/native.c             |  110 ++
 arch/powerpc/sysdev/xive/spapr.c              |   28 +-
 virt/kvm/kvm_main.c                           |   39 +
 arch/powerpc/kvm/Makefile                     |    4 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S       |   52 +
 .../powerpc/platforms/powernv/opal-wrappers.S |    3 +
 23 files changed, 2722 insertions(+), 65 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_rm_xive_native.c
 create mode 100644 arch/powerpc/kvm/book3s_xive_native.c
 create mode 100644 arch/powerpc/kvm/book3s_xive_native_template.c

Comments

Paul Mackerras Jan. 22, 2019, 4:46 a.m. UTC | #1
On Mon, Jan 07, 2019 at 07:43:12PM +0100, Cédric Le Goater wrote:
> Hello,
> 
> On the POWER9 processor, the XIVE interrupt controller can control
> interrupt sources using MMIO to trigger events, to EOI or to turn off
> the sources. Priority management and interrupt acknowledgment is also
> controlled by MMIO in the CPU presenter subengine.
> 
> PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
> special support from the hypervisor to do the same. This is called the
> XIVE native exploitation mode and today, it can be activated under the
> PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
> and still offers the old interrupt mode interface using a
> XICS-over-XIVE glue which implements the XICS hcalls.
> 
> The following series is proposal to add the same support under KVM.
> 
> A new KVM device is introduced for the XIVE native exploitation
> mode. It reuses most of the XICS-over-XIVE glue implementation
> structures which are internal to KVM but has a completely different
> interface. A set of Hypervisor calls configures the sources and the
> event queues and from there, all control is done by the guest through
> MMIOs.
> 
> These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
> similarly to VFIO, and the associated VMAs are populated dynamically
> with the appropriate pages using a fault handler. This is implemented
> with a couple of KVM device ioctls.
> 
> On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
> negotiation process determines whether the guest operates with a
> interrupt controller using the XICS legacy model, as found on POWER8,
> or in XIVE exploitation mode. Which means that the KVM interrupt
> device should be created at runtime, after the machine as started.
> This requires extra KVM support to create/destroy KVM devices. The
> last patches are an attempt to solve that problem.
> 
> Migration has its own specific needs. The patchset provides the
> necessary routines to quiesce XIVE, to capture and restore the state
> of the different structures used by KVM, OPAL and HW. Extra OPAL
> support is required for these.

Thanks for the patchset.  It mostly looks good, but there are some
more things we need to consider, and I think a v2 will be needed.

One general comment I have is that there are a lot of acronyms in this
code and you mostly seem to assume that people will know what they all
mean.  It would make the code more readable if you provide the
expansion of the acronym on first use in a comment or whatever.  For
example, one of the patches in this series talks about the "EAS"
without ever expanding it in any comment or in the patch description,
and I have forgotten just at the moment what EAS stands for (I just
know that understanding the XIVE is not eas-y. :)

Another general comment is that you seem to have written all this
code assuming we are using HV KVM in a host running bare-metal.
However, we could be using PR KVM (either in a bare-metal host or in a
guest), or we could be doing nested HV KVM where we are using the
kvm_hv module inside a KVM guest and using special hypercalls for
controlling our guests.

It would be perfectly acceptable for now to say that we don't yet
support XIVE exploitation in those scenarios, as long as we then make
sure that the new KVM capability reports false in those scenarios, and
any attempt to use the XIVE exploitation interfaces fails cleanly.
I don't see that either of those is true in the patch set as it
stands, so that is one area that needs to be fixed.

A third general comment is that the new KVM interfaces you have added
need to be documented in the files under Documentation/virtual/kvm.

Paul.
Cédric Le Goater Jan. 23, 2019, 7:07 p.m. UTC | #2
On 1/22/19 5:46 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:12PM +0100, Cédric Le Goater wrote:
>> Hello,
>>
>> On the POWER9 processor, the XIVE interrupt controller can control
>> interrupt sources using MMIO to trigger events, to EOI or to turn off
>> the sources. Priority management and interrupt acknowledgment is also
>> controlled by MMIO in the CPU presenter subengine.
>>
>> PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
>> special support from the hypervisor to do the same. This is called the
>> XIVE native exploitation mode and today, it can be activated under the
>> PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
>> and still offers the old interrupt mode interface using a
>> XICS-over-XIVE glue which implements the XICS hcalls.
>>
>> The following series is proposal to add the same support under KVM.
>>
>> A new KVM device is introduced for the XIVE native exploitation
>> mode. It reuses most of the XICS-over-XIVE glue implementation
>> structures which are internal to KVM but has a completely different
>> interface. A set of Hypervisor calls configures the sources and the
>> event queues and from there, all control is done by the guest through
>> MMIOs.
>>
>> These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
>> similarly to VFIO, and the associated VMAs are populated dynamically
>> with the appropriate pages using a fault handler. This is implemented
>> with a couple of KVM device ioctls.
>>
>> On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
>> negotiation process determines whether the guest operates with a
>> interrupt controller using the XICS legacy model, as found on POWER8,
>> or in XIVE exploitation mode. Which means that the KVM interrupt
>> device should be created at runtime, after the machine as started.
>> This requires extra KVM support to create/destroy KVM devices. The
>> last patches are an attempt to solve that problem.
>>
>> Migration has its own specific needs. The patchset provides the
>> necessary routines to quiesce XIVE, to capture and restore the state
>> of the different structures used by KVM, OPAL and HW. Extra OPAL
>> support is required for these.
> 
> Thanks for the patchset.  It mostly looks good, but there are some
> more things we need to consider, and I think a v2 will be needed.
>> One general comment I have is that there are a lot of acronyms in this
> code and you mostly seem to assume that people will know what they all
> mean.  It would make the code more readable if you provide the
> expansion of the acronym on first use in a comment or whatever.  For
> example, one of the patches in this series talks about the "EAS"

 Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)

All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
Linux should be adjusted ...

> without ever expanding it in any comment or in the patch description,
> and I have forgotten just at the moment what EAS stands for (I just
> know that understanding the XIVE is not eas-y. :)
ah ! yes. But we have great documentation :)

We pushed some high level description of XIVE in QEMU :

  https://git.qemu.org/?p=qemu.git;a=blob;f=include/hw/ppc/xive.h;h=ec23253ba448e25c621356b55a7777119a738f8e;hb=HEAD

I should do the same for Linux with a KVM section to explain the 
interfaces which do not directly expose the underlying XIVE concepts. 
It's better to understand a little what is happening under the hood.

> Another general comment is that you seem to have written all this
> code assuming we are using HV KVM in a host running bare-metal.

Yes. I didn't look at the other configurations. I thought that we could
use the kernel_irqchip=off option to begin with. A couple of checks
are indeed missing.

> However, we could be using PR KVM (either in a bare-metal host or in a
> guest), or we could be doing nested HV KVM where we are using the
> kvm_hv module inside a KVM guest and using special hypercalls for
> controlling our guests.

Yes. 

It would be good to talk a little about the nested support (offline 
may be) to make sure that we are not missing some major interface that 
would require a lot of change. If we need to prepare ground, I think
the timing is good.

The size of the IRQ number space might be a problem. It seems we 
would need to increase it considerably to support multiple nested 
guests. That said I haven't look much how nested is designed.  

> It would be perfectly acceptable for now to say that we don't yet
> support XIVE exploitation in those scenarios, as long as we then make
> sure that the new KVM capability reports false in those scenarios, and
> any attempt to use the XIVE exploitation interfaces fails cleanly.

ok. That looks the best approach for now.

> I don't see that either of those is true in the patch set as it
> stands, so that is one area that needs to be fixed.
> 
> A third general comment is that the new KVM interfaces you have added
> need to be documented in the files under Documentation/virtual/kvm.

ok. 

Thanks,

C.
Benjamin Herrenschmidt Jan. 23, 2019, 9:35 p.m. UTC | #3
On Wed, 2019-01-23 at 20:07 +0100, Cédric Le Goater wrote:
>  Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)
> 
> All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
> Linux should be adjusted ...

All the names changed between the HW design and the "architecture"
document. The HW guys use the old names, the architecture the new
names, and Linux & OPAL mostly use the old ones because frankly the new
names suck big time.

> It would be good to talk a little about the nested support (offline 
> may be) to make sure that we are not missing some major interface that 
> would require a lot of change. If we need to prepare ground, I think
> the timing is good.
> 
> The size of the IRQ number space might be a problem. It seems we 
> would need to increase it considerably to support multiple nested 
> guests. That said I haven't look much how nested is designed.  

The size of the VP space is a bigger concern. Even today. We really
need qemu to tell the max #cpu to KVM so we can allocate less of them.

As for nesting, I suggest for the foreseeable future we stick to XICS
emulation in nested guests.

Cheers,
Ben.
Cédric Le Goater Jan. 26, 2019, 8:25 a.m. UTC | #4
Was there a crashing.org shutdown ? 

  Received: from gate.crashing.org (gate.crashing.org [63.228.1.57])
	by in5.mail.ovh.net (Postfix) with ESMTPS id 43mYnj0nrlz1N7KC
	for <clg@kaod.org>; Fri, 25 Jan 2019 22:38:00 +0000 (UTC)
  Received: from localhost (localhost.localdomain [127.0.0.1])
	by gate.crashing.org (8.14.1/8.14.1) with ESMTP id x0NLZf4K021092;
	Wed, 23 Jan 2019 15:35:43 -0600


On 1/23/19 10:35 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2019-01-23 at 20:07 +0100, Cédric Le Goater wrote:
>>  Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)
>>
>> All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
>> Linux should be adjusted ...
> 
> All the names changed between the HW design and the "architecture"
> document. The HW guys use the old names, the architecture the new
> names, and Linux & OPAL mostly use the old ones because frankly the new
> names suck big time.

Well, It does not make XIVE any clearer ... I did prefer the v1 names
but there was some naming overlap in the concepts. 

>> It would be good to talk a little about the nested support (offline 
>> may be) to make sure that we are not missing some major interface that 
>> would require a lot of change. If we need to prepare ground, I think
>> the timing is good.
>>
>> The size of the IRQ number space might be a problem. It seems we 
>> would need to increase it considerably to support multiple nested 
>> guests. That said I haven't look much how nested is designed.  
> 
> The size of the VP space is a bigger concern. Even today. We really
> need qemu to tell the max #cpu to KVM so we can allocate less of them.

ah yes. we would also need to reduce the number of available priorities 
per CPU to have more EQ descriptors available if I recall well. 

> As for nesting, I suggest for the foreseeable future we stick to XICS
> emulation in nested guests.

ok. so no kernel_irqchip at all. hmm.  

I was wondering how possible it was to have L2 initialize the underlying 
OPAL structures in the L0 hypervisor. May be with a sort of proxy hcall 
which would perform the initialization in QEMU L1 on behalf of L2.

Cheers,
C.
Paul Mackerras Jan. 28, 2019, 5:51 a.m. UTC | #5
On Wed, Jan 23, 2019 at 08:07:33PM +0100, Cédric Le Goater wrote:
> On 1/22/19 5:46 AM, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 07:43:12PM +0100, Cédric Le Goater wrote:
> >> Hello,
> >>
> >> On the POWER9 processor, the XIVE interrupt controller can control
> >> interrupt sources using MMIO to trigger events, to EOI or to turn off
> >> the sources. Priority management and interrupt acknowledgment is also
> >> controlled by MMIO in the CPU presenter subengine.
> >>
> >> PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
> >> special support from the hypervisor to do the same. This is called the
> >> XIVE native exploitation mode and today, it can be activated under the
> >> PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
> >> and still offers the old interrupt mode interface using a
> >> XICS-over-XIVE glue which implements the XICS hcalls.
> >>
> >> The following series is proposal to add the same support under KVM.
> >>
> >> A new KVM device is introduced for the XIVE native exploitation
> >> mode. It reuses most of the XICS-over-XIVE glue implementation
> >> structures which are internal to KVM but has a completely different
> >> interface. A set of Hypervisor calls configures the sources and the
> >> event queues and from there, all control is done by the guest through
> >> MMIOs.
> >>
> >> These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
> >> similarly to VFIO, and the associated VMAs are populated dynamically
> >> with the appropriate pages using a fault handler. This is implemented
> >> with a couple of KVM device ioctls.
> >>
> >> On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
> >> negotiation process determines whether the guest operates with a
> >> interrupt controller using the XICS legacy model, as found on POWER8,
> >> or in XIVE exploitation mode. Which means that the KVM interrupt
> >> device should be created at runtime, after the machine as started.
> >> This requires extra KVM support to create/destroy KVM devices. The
> >> last patches are an attempt to solve that problem.
> >>
> >> Migration has its own specific needs. The patchset provides the
> >> necessary routines to quiesce XIVE, to capture and restore the state
> >> of the different structures used by KVM, OPAL and HW. Extra OPAL
> >> support is required for these.
> > 
> > Thanks for the patchset.  It mostly looks good, but there are some
> > more things we need to consider, and I think a v2 will be needed.
> >> One general comment I have is that there are a lot of acronyms in this
> > code and you mostly seem to assume that people will know what they all
> > mean.  It would make the code more readable if you provide the
> > expansion of the acronym on first use in a comment or whatever.  For
> > example, one of the patches in this series talks about the "EAS"
> 
>  Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)
> 
> All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
> Linux should be adjusted ...
> 
> > without ever expanding it in any comment or in the patch description,
> > and I have forgotten just at the moment what EAS stands for (I just
> > know that understanding the XIVE is not eas-y. :)
> ah ! yes. But we have great documentation :)
> 
> We pushed some high level description of XIVE in QEMU :
> 
>   https://git.qemu.org/?p=qemu.git;a=blob;f=include/hw/ppc/xive.h;h=ec23253ba448e25c621356b55a7777119a738f8e;hb=HEAD
> 
> I should do the same for Linux with a KVM section to explain the 
> interfaces which do not directly expose the underlying XIVE concepts. 
> It's better to understand a little what is happening under the hood.
> 
> > Another general comment is that you seem to have written all this
> > code assuming we are using HV KVM in a host running bare-metal.
> 
> Yes. I didn't look at the other configurations. I thought that we could
> use the kernel_irqchip=off option to begin with. A couple of checks
> are indeed missing.

Using kernel_irqchip=off would mean that we would not be able to use
the in-kernel XICS emulation, which would have a performance impact.

We need an explicit capability for XIVE exploitation that can be
enabled or disabled on the qemu command line, so that we can enforce a
uniform set of capabilities across all the hosts in a migration
domain.  And it's no good to say we have the capability when all
attempts to use it will fail.  Therefore the kernel needs to say that
it doesn't have the capability in a PR KVM guest or in a nested HV
guest.

> > However, we could be using PR KVM (either in a bare-metal host or in a
> > guest), or we could be doing nested HV KVM where we are using the
> > kvm_hv module inside a KVM guest and using special hypercalls for
> > controlling our guests.
> 
> Yes. 
> 
> It would be good to talk a little about the nested support (offline 
> may be) to make sure that we are not missing some major interface that 
> would require a lot of change. If we need to prepare ground, I think
> the timing is good.
> 
> The size of the IRQ number space might be a problem. It seems we 
> would need to increase it considerably to support multiple nested 
> guests. That said I haven't look much how nested is designed.  

The current design of nested HV is that the entire non-volatile state
of all the nested guests is encapsulated within the state and
resources of the L1 hypervisor.  That means that if the L1 hypervisor
gets migrated, all of its guests go across inside it and there is no
extra state that L0 needs to be aware of.  That would imply that the
VP number space for the nested guests would need to come from within
the VP number space for L1; but the amount of VP space we allocate to
each guest doesn't seem to be large enough for that to be practical.

> > It would be perfectly acceptable for now to say that we don't yet
> > support XIVE exploitation in those scenarios, as long as we then make
> > sure that the new KVM capability reports false in those scenarios, and
> > any attempt to use the XIVE exploitation interfaces fails cleanly.
> 
> ok. That looks the best approach for now.
> 
> > I don't see that either of those is true in the patch set as it
> > stands, so that is one area that needs to be fixed.
> > 
> > A third general comment is that the new KVM interfaces you have added
> > need to be documented in the files under Documentation/virtual/kvm.
> 
> ok. 
> 
> Thanks,
> 
> C. 
> 

Paul.
Cédric Le Goater Jan. 29, 2019, 1:51 p.m. UTC | #6
>>> Another general comment is that you seem to have written all this
>>> code assuming we are using HV KVM in a host running bare-metal.
>>
>> Yes. I didn't look at the other configurations. I thought that we could
>> use the kernel_irqchip=off option to begin with. A couple of checks
>> are indeed missing.
> 
> Using kernel_irqchip=off would mean that we would not be able to use
> the in-kernel XICS emulation, which would have a performance impact.

yes. But it is not supported today. Correct ? 

> We need an explicit capability for XIVE exploitation that can be
> enabled or disabled on the qemu command line, so that we can enforce a
> uniform set of capabilities across all the hosts in a migration
> domain.  And it's no good to say we have the capability when all
> attempts to use it will fail.  Therefore the kernel needs to say that
> it doesn't have the capability in a PR KVM guest or in a nested HV
> guest.

OK. I will work on adding a KVM_CAP_PPC_NESTED_IRQ_HV capability 
for future use.

>>> However, we could be using PR KVM (either in a bare-metal host or in a
>>> guest), or we could be doing nested HV KVM where we are using the
>>> kvm_hv module inside a KVM guest and using special hypercalls for
>>> controlling our guests.
>>
>> Yes. 
>>
>> It would be good to talk a little about the nested support (offline 
>> may be) to make sure that we are not missing some major interface that 
>> would require a lot of change. If we need to prepare ground, I think
>> the timing is good.
>>
>> The size of the IRQ number space might be a problem. It seems we 
>> would need to increase it considerably to support multiple nested 
>> guests. That said I haven't look much how nested is designed.  
> 
> The current design of nested HV is that the entire non-volatile state
> of all the nested guests is encapsulated within the state and
> resources of the L1 hypervisor.  That means that if the L1 hypervisor
> gets migrated, all of its guests go across inside it and there is no
> extra state that L0 needs to be aware of.  That would imply that the
> VP number space for the nested guests would need to come from within
> the VP number space for L1; but the amount of VP space we allocate to
> each guest doesn't seem to be large enough for that to be practical.

If the KVM XIVE device had some information on the max number of CPUs 
provisioned for the guest, we could optimize the VP allocation.

That might be a larger KVM topic though. There are some static limits 
on the number of CPUs in QEMU and in KVM, which have no relation AFAICT. 

Thanks,

C.
Paul Mackerras Jan. 30, 2019, 5:40 a.m. UTC | #7
On Tue, Jan 29, 2019 at 02:51:05PM +0100, Cédric Le Goater wrote:
> >>> Another general comment is that you seem to have written all this
> >>> code assuming we are using HV KVM in a host running bare-metal.
> >>
> >> Yes. I didn't look at the other configurations. I thought that we could
> >> use the kernel_irqchip=off option to begin with. A couple of checks
> >> are indeed missing.
> > 
> > Using kernel_irqchip=off would mean that we would not be able to use
> > the in-kernel XICS emulation, which would have a performance impact.
> 
> yes. But it is not supported today. Correct ? 

Not correct, it has been working for years, and works in v5.0-rc1 (I
just tested it), at both L0 and L1.

> > We need an explicit capability for XIVE exploitation that can be
> > enabled or disabled on the qemu command line, so that we can enforce a
> > uniform set of capabilities across all the hosts in a migration
> > domain.  And it's no good to say we have the capability when all
> > attempts to use it will fail.  Therefore the kernel needs to say that
> > it doesn't have the capability in a PR KVM guest or in a nested HV
> > guest.
> 
> OK. I will work on adding a KVM_CAP_PPC_NESTED_IRQ_HV capability 
> for future use.

That's not what I meant.  Why do we need that?  I meant that querying
the new KVM_CAP_PPC_IRQ_XIVE capability should return 0 if we are in a
guest.  It should only return 1 if we are running bare-metal on a P9.

> >>> However, we could be using PR KVM (either in a bare-metal host or in a
> >>> guest), or we could be doing nested HV KVM where we are using the
> >>> kvm_hv module inside a KVM guest and using special hypercalls for
> >>> controlling our guests.
> >>
> >> Yes. 
> >>
> >> It would be good to talk a little about the nested support (offline 
> >> may be) to make sure that we are not missing some major interface that 
> >> would require a lot of change. If we need to prepare ground, I think
> >> the timing is good.
> >>
> >> The size of the IRQ number space might be a problem. It seems we 
> >> would need to increase it considerably to support multiple nested 
> >> guests. That said I haven't look much how nested is designed.  
> > 
> > The current design of nested HV is that the entire non-volatile state
> > of all the nested guests is encapsulated within the state and
> > resources of the L1 hypervisor.  That means that if the L1 hypervisor
> > gets migrated, all of its guests go across inside it and there is no
> > extra state that L0 needs to be aware of.  That would imply that the
> > VP number space for the nested guests would need to come from within
> > the VP number space for L1; but the amount of VP space we allocate to
> > each guest doesn't seem to be large enough for that to be practical.
> 
> If the KVM XIVE device had some information on the max number of CPUs 
> provisioned for the guest, we could optimize the VP allocation.

The problem is that we might have 1000 guests running under L0, or we
might have 1 guest running under L0 and 1000 guests running under it,
and we have no way to know which situation to optimize for at the
point where an L1 guest starts.  If we had an enormous VP space then
we could just give each L1 guest a large amount of VP space and solve
it that way; but we don't.

Paul.
Cédric Le Goater Jan. 30, 2019, 3:36 p.m. UTC | #8
On 1/30/19 6:40 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 02:51:05PM +0100, Cédric Le Goater wrote:
>>>>> Another general comment is that you seem to have written all this
>>>>> code assuming we are using HV KVM in a host running bare-metal.
>>>>
>>>> Yes. I didn't look at the other configurations. I thought that we could
>>>> use the kernel_irqchip=off option to begin with. A couple of checks
>>>> are indeed missing.
>>>
>>> Using kernel_irqchip=off would mean that we would not be able to use
>>> the in-kernel XICS emulation, which would have a performance impact.
>>
>> yes. But it is not supported today. Correct ? 
> 
> Not correct, it has been working for years, and works in v5.0-rc1 (I
> just tested it), at both L0 and L1.

Please see other email for the test is did.

>>> We need an explicit capability for XIVE exploitation that can be
>>> enabled or disabled on the qemu command line, so that we can enforce a
>>> uniform set of capabilities across all the hosts in a migration
>>> domain.  And it's no good to say we have the capability when all
>>> attempts to use it will fail.  Therefore the kernel needs to say that
>>> it doesn't have the capability in a PR KVM guest or in a nested HV
>>> guest.
>>
>> OK. I will work on adding a KVM_CAP_PPC_NESTED_IRQ_HV capability 
>> for future use.
> 
> That's not what I meant.  Why do we need that?  I meant that querying
> the new KVM_CAP_PPC_IRQ_XIVE capability should return 0 if we are in a
> guest.  It should only return 1 if we are running bare-metal on a P9.

ok. I guess I need to understand first how the nested guest uses the 
KVM IRQ device. That is a question in another email thread.   

>>>>> However, we could be using PR KVM (either in a bare-metal host or in a
>>>>> guest), or we could be doing nested HV KVM where we are using the
>>>>> kvm_hv module inside a KVM guest and using special hypercalls for
>>>>> controlling our guests.
>>>>
>>>> Yes. 
>>>>
>>>> It would be good to talk a little about the nested support (offline 
>>>> may be) to make sure that we are not missing some major interface that 
>>>> would require a lot of change. If we need to prepare ground, I think
>>>> the timing is good.
>>>>
>>>> The size of the IRQ number space might be a problem. It seems we 
>>>> would need to increase it considerably to support multiple nested 
>>>> guests. That said I haven't look much how nested is designed.  
>>>
>>> The current design of nested HV is that the entire non-volatile state
>>> of all the nested guests is encapsulated within the state and
>>> resources of the L1 hypervisor.  That means that if the L1 hypervisor
>>> gets migrated, all of its guests go across inside it and there is no
>>> extra state that L0 needs to be aware of.  That would imply that the
>>> VP number space for the nested guests would need to come from within
>>> the VP number space for L1; but the amount of VP space we allocate to
>>> each guest doesn't seem to be large enough for that to be practical.
>>
>> If the KVM XIVE device had some information on the max number of CPUs 
>> provisioned for the guest, we could optimize the VP allocation.
> 
> The problem is that we might have 1000 guests running under L0, or we
> might have 1 guest running under L0 and 1000 guests running under it,
> and we have no way to know which situation to optimize for at the
> point where an L1 guest starts.  If we had an enormous VP space then
> we could just give each L1 guest a large amount of VP space and solve
> it that way; but we don't.

There are some ideas to increase our VP space size. Using multiblock 
per XIVE chip in skiboot is one I think. It's not an obvious change. 
Also, XIVE2 will add more bits to the NVT index so we will be free 
to allocate more at once when P10 is available.

On the same topic, may be we could move the VP allocator from skiboot
to KVM, allocate the full VP space at the KVM level and let KVM do 
the VP segmentation. 

Any how, I think that if we knew how much VPs we need to provision for 
when the KVM XIVE device is created, we would make a better use of the 
available space. Shouldn't we ?

Thanks,

C.
David Gibson Feb. 4, 2019, 5:36 a.m. UTC | #9
On Sat, Jan 26, 2019 at 09:25:04AM +0100, Cédric Le Goater wrote:
> Was there a crashing.org shutdown ? 
> 
>   Received: from gate.crashing.org (gate.crashing.org [63.228.1.57])
> 	by in5.mail.ovh.net (Postfix) with ESMTPS id 43mYnj0nrlz1N7KC
> 	for <clg@kaod.org>; Fri, 25 Jan 2019 22:38:00 +0000 (UTC)
>   Received: from localhost (localhost.localdomain [127.0.0.1])
> 	by gate.crashing.org (8.14.1/8.14.1) with ESMTP id x0NLZf4K021092;
> 	Wed, 23 Jan 2019 15:35:43 -0600
> 
> 
> On 1/23/19 10:35 PM, Benjamin Herrenschmidt wrote:
> > On Wed, 2019-01-23 at 20:07 +0100, Cédric Le Goater wrote:
> >>  Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)
> >>
> >> All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
> >> Linux should be adjusted ...
> > 
> > All the names changed between the HW design and the "architecture"
> > document. The HW guys use the old names, the architecture the new
> > names, and Linux & OPAL mostly use the old ones because frankly the new
> > names suck big time.
> 
> Well, It does not make XIVE any clearer ... I did prefer the v1 names
> but there was some naming overlap in the concepts. 
> 
> >> It would be good to talk a little about the nested support (offline 
> >> may be) to make sure that we are not missing some major interface that 
> >> would require a lot of change. If we need to prepare ground, I think
> >> the timing is good.
> >>
> >> The size of the IRQ number space might be a problem. It seems we 
> >> would need to increase it considerably to support multiple nested 
> >> guests. That said I haven't look much how nested is designed.  
> > 
> > The size of the VP space is a bigger concern. Even today. We really
> > need qemu to tell the max #cpu to KVM so we can allocate less of them.
> 
> ah yes. we would also need to reduce the number of available priorities 
> per CPU to have more EQ descriptors available if I recall well. 
> 
> > As for nesting, I suggest for the foreseeable future we stick to XICS
> > emulation in nested guests.
> 
> ok. so no kernel_irqchip at all. hmm.  

That would certainly be step 0, making sure the capability advertises
this correctly.  I think we do want to make XICs-on-XIVE emulation
work in a KVM L1 (so we'd need to have it make XIVE hcalls to the L0
instead of OPAL calls).

XIVE-on-XIVE for L1 would be nice too, which would mean implementing
the XIVE hcalls from the L2 in terms of XIVE hcalls to the L0.  I
think it's ok to delay this indefinitely as long as the caps advertise
correctly so that qemu will use userspace emulation until its ready.

> I was wondering how possible it was to have L2 initialize the underlying 
> OPAL structures in the L0 hypervisor. May be with a sort of proxy hcall 
> which would perform the initialization in QEMU L1 on behalf of L2.
>
Cédric Le Goater Feb. 5, 2019, 11:31 a.m. UTC | #10
>>> As for nesting, I suggest for the foreseeable future we stick to XICS
>>> emulation in nested guests.
>>
>> ok. so no kernel_irqchip at all. hmm. 

I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
device KVM uses when under a P9 processor. 

> That would certainly be step 0, making sure the capability advertises
> this correctly.  I think we do want to make XICs-on-XIVE emulation
> work in a KVM L1 (so we'd need to have it make XIVE hcalls to the L0
> instead of OPAL calls).

With the latest patch of Paul, the KVM XICS device is available for L2
and it works quite well. 

I also want to test it when L1 runs in KVM XIVE native mode, with the 
current patchset, to see how it behaves.

> XIVE-on-XIVE for L1 would be nice too, which would mean implementing
> the XIVE hcalls from the L2 in terms of XIVE hcalls to the L0.  I
> think it's ok to delay this indefinitely as long as the caps advertise
> correctly so that qemu will use userspace emulation until its ready.

ok. I need to fix this in the current patchset.

Thanks,

C.
Paul Mackerras Feb. 5, 2019, 10:13 p.m. UTC | #11
On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
> >>> As for nesting, I suggest for the foreseeable future we stick to XICS
> >>> emulation in nested guests.
> >>
> >> ok. so no kernel_irqchip at all. hmm. 
> 
> I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
> XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
> device KVM uses when under a P9 processor. 

Actually there are two separate implementations of XICS emulation in
KVM.  The first (older) one is almost entirely a software emulation
but does have some cases where it accesses an underlying XICS device
in order to make some things faster (IPIs and pass-through of a device
interrupt to a guest).  The other, newer one is the XICS-on-XIVE
emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
My patch was about making the the older code work when there is no
XICS available to the host.

Paul.
David Gibson Feb. 6, 2019, 1:18 a.m. UTC | #12
On Wed, Feb 06, 2019 at 09:13:15AM +1100, Paul Mackerras wrote:
> On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
> > >>> As for nesting, I suggest for the foreseeable future we stick to XICS
> > >>> emulation in nested guests.
> > >>
> > >> ok. so no kernel_irqchip at all. hmm. 
> > 
> > I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
> > XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
> > device KVM uses when under a P9 processor. 
> 
> Actually there are two separate implementations of XICS emulation in
> KVM.  The first (older) one is almost entirely a software emulation
> but does have some cases where it accesses an underlying XICS device
> in order to make some things faster (IPIs and pass-through of a device
> interrupt to a guest).  The other, newer one is the XICS-on-XIVE
> emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
> My patch was about making the the older code work when there is no
> XICS available to the host.

Ah, right.  To clarify my earlier statements in light of this:

 * We definitely want some sort of kernel-XICS available in a nested
   guest.  AIUI, this is now accomplished, so, Yay!

 * Implementing the L2 XICS in terms of L1's PAPR-XIVE would be a
   bonus, but it's a much lower priority.
Cédric Le Goater Feb. 6, 2019, 7:35 a.m. UTC | #13
On 2/6/19 2:18 AM, David Gibson wrote:
> On Wed, Feb 06, 2019 at 09:13:15AM +1100, Paul Mackerras wrote:
>> On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
>>>>>> As for nesting, I suggest for the foreseeable future we stick to XICS
>>>>>> emulation in nested guests.
>>>>>
>>>>> ok. so no kernel_irqchip at all. hmm. 
>>>
>>> I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
>>> XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
>>> device KVM uses when under a P9 processor. 
>>
>> Actually there are two separate implementations of XICS emulation in
>> KVM.  The first (older) one is almost entirely a software emulation
>> but does have some cases where it accesses an underlying XICS device
>> in order to make some things faster (IPIs and pass-through of a device
>> interrupt to a guest).  The other, newer one is the XICS-on-XIVE
>> emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
>> My patch was about making the the older code work when there is no
>> XICS available to the host.
> 
> Ah, right.  To clarify my earlier statements in light of this:
> 
>  * We definitely want some sort of kernel-XICS available in a nested
>    guest.  AIUI, this is now accomplished, so, Yay!
> 
>  * Implementing the L2 XICS in terms of L1's PAPR-XIVE would be a
>    bonus, but it's a much lower priority.

Yes. In this case, the L1 KVM-HV should not advertise KVM_CAP_PPC_IRQ_XIVE
to QEMU which will restrict CAS to the XICS only interrupt mode.

C.
David Gibson Feb. 7, 2019, 2:51 a.m. UTC | #14
On Wed, Feb 06, 2019 at 08:35:24AM +0100, Cédric Le Goater wrote:
> On 2/6/19 2:18 AM, David Gibson wrote:
> > On Wed, Feb 06, 2019 at 09:13:15AM +1100, Paul Mackerras wrote:
> >> On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
> >>>>>> As for nesting, I suggest for the foreseeable future we stick to XICS
> >>>>>> emulation in nested guests.
> >>>>>
> >>>>> ok. so no kernel_irqchip at all. hmm. 
> >>>
> >>> I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
> >>> XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
> >>> device KVM uses when under a P9 processor. 
> >>
> >> Actually there are two separate implementations of XICS emulation in
> >> KVM.  The first (older) one is almost entirely a software emulation
> >> but does have some cases where it accesses an underlying XICS device
> >> in order to make some things faster (IPIs and pass-through of a device
> >> interrupt to a guest).  The other, newer one is the XICS-on-XIVE
> >> emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
> >> My patch was about making the the older code work when there is no
> >> XICS available to the host.
> > 
> > Ah, right.  To clarify my earlier statements in light of this:
> > 
> >  * We definitely want some sort of kernel-XICS available in a nested
> >    guest.  AIUI, this is now accomplished, so, Yay!
> > 
> >  * Implementing the L2 XICS in terms of L1's PAPR-XIVE would be a
> >    bonus, but it's a much lower priority.
> 
> Yes. In this case, the L1 KVM-HV should not advertise KVM_CAP_PPC_IRQ_XIVE
> to QEMU which will restrict CAS to the XICS only interrupt mode.

Uh... no... we shouldn't change what's available to the guest based on
host configuration only.  We should just stop advertising the CAP
saying that *KVM implemented* is available so that qemu will fall back
to userspace XIVE emulation.
Cédric Le Goater Feb. 7, 2019, 8:31 a.m. UTC | #15
On 2/7/19 3:51 AM, David Gibson wrote:
> On Wed, Feb 06, 2019 at 08:35:24AM +0100, Cédric Le Goater wrote:
>> On 2/6/19 2:18 AM, David Gibson wrote:
>>> On Wed, Feb 06, 2019 at 09:13:15AM +1100, Paul Mackerras wrote:
>>>> On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
>>>>>>>> As for nesting, I suggest for the foreseeable future we stick to XICS
>>>>>>>> emulation in nested guests.
>>>>>>>
>>>>>>> ok. so no kernel_irqchip at all. hmm. 
>>>>>
>>>>> I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
>>>>> XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
>>>>> device KVM uses when under a P9 processor. 
>>>>
>>>> Actually there are two separate implementations of XICS emulation in
>>>> KVM.  The first (older) one is almost entirely a software emulation
>>>> but does have some cases where it accesses an underlying XICS device
>>>> in order to make some things faster (IPIs and pass-through of a device
>>>> interrupt to a guest).  The other, newer one is the XICS-on-XIVE
>>>> emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
>>>> My patch was about making the the older code work when there is no
>>>> XICS available to the host.
>>>
>>> Ah, right.  To clarify my earlier statements in light of this:
>>>
>>>  * We definitely want some sort of kernel-XICS available in a nested
>>>    guest.  AIUI, this is now accomplished, so, Yay!
>>>
>>>  * Implementing the L2 XICS in terms of L1's PAPR-XIVE would be a
>>>    bonus, but it's a much lower priority.
>>
>> Yes. In this case, the L1 KVM-HV should not advertise KVM_CAP_PPC_IRQ_XIVE
>> to QEMU which will restrict CAS to the XICS only interrupt mode.
> 
> Uh... no... we shouldn't change what's available to the guest based on
> host configuration only.  We should just stop advertising the CAP
> saying that *KVM implemented* is available 

yes. that is what I meant.

> so that qemu will fall back to userspace XIVE emulation.

even if kernel_irqchip is required ? 

Today, QEMU just fails to start. With the dual mode, the interrupt mode 
is negotiated at CAS time and when merged, the KVM device will be created 
at reset. In case of failure, QEMU will abort. 

I am not saying it is not possible but we will need some internal 
infrastructure to handle dynamically the fall back to userspace emulation.

C.
David Gibson Feb. 8, 2019, 5:07 a.m. UTC | #16
On Thu, Feb 07, 2019 at 09:31:06AM +0100, Cédric Le Goater wrote:
> On 2/7/19 3:51 AM, David Gibson wrote:
> > On Wed, Feb 06, 2019 at 08:35:24AM +0100, Cédric Le Goater wrote:
> >> On 2/6/19 2:18 AM, David Gibson wrote:
> >>> On Wed, Feb 06, 2019 at 09:13:15AM +1100, Paul Mackerras wrote:
> >>>> On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
> >>>>>>>> As for nesting, I suggest for the foreseeable future we stick to XICS
> >>>>>>>> emulation in nested guests.
> >>>>>>>
> >>>>>>> ok. so no kernel_irqchip at all. hmm. 
> >>>>>
> >>>>> I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
> >>>>> XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
> >>>>> device KVM uses when under a P9 processor. 
> >>>>
> >>>> Actually there are two separate implementations of XICS emulation in
> >>>> KVM.  The first (older) one is almost entirely a software emulation
> >>>> but does have some cases where it accesses an underlying XICS device
> >>>> in order to make some things faster (IPIs and pass-through of a device
> >>>> interrupt to a guest).  The other, newer one is the XICS-on-XIVE
> >>>> emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
> >>>> My patch was about making the the older code work when there is no
> >>>> XICS available to the host.
> >>>
> >>> Ah, right.  To clarify my earlier statements in light of this:
> >>>
> >>>  * We definitely want some sort of kernel-XICS available in a nested
> >>>    guest.  AIUI, this is now accomplished, so, Yay!
> >>>
> >>>  * Implementing the L2 XICS in terms of L1's PAPR-XIVE would be a
> >>>    bonus, but it's a much lower priority.
> >>
> >> Yes. In this case, the L1 KVM-HV should not advertise KVM_CAP_PPC_IRQ_XIVE
> >> to QEMU which will restrict CAS to the XICS only interrupt mode.
> > 
> > Uh... no... we shouldn't change what's available to the guest based on
> > host configuration only.  We should just stop advertising the CAP
> > saying that *KVM implemented* is available 
> 
> yes. that is what I meant.
> 
> > so that qemu will fall back to userspace XIVE emulation.
> 
> even if kernel_irqchip is required ? 

Well, no, but if we don't specify.

> Today, QEMU just fails to start.

If we specify kernel_irqchip=on but the kernel can't support that I
think that's the right thing to do.

> With the dual mode, the interrupt mode 
> is negotiated at CAS time and when merged, the KVM device will be created 
> at reset. In case of failure, QEMU will abort. 
> 
> I am not saying it is not possible but we will need some internal 
> infrastructure to handle dynamically the fall back to userspace
> emulation.

Uh.. we do?  I think in all cases we need to make the XICS vs. XIVE
decision first (i.e. what we present to the guest), then we should
decide how to implement it (userspace, KVM accelerated, impossible and
give up).
Cédric Le Goater Feb. 8, 2019, 7:38 a.m. UTC | #17
>> With the dual mode, the interrupt mode 
>> is negotiated at CAS time and when merged, the KVM device will be created 
>> at reset. In case of failure, QEMU will abort. 
>>
>> I am not saying it is not possible but we will need some internal 
>> infrastructure to handle dynamically the fall back to userspace
>> emulation.
> 
> Uh.. we do?  I think in all cases we need to make the XICS vs. XIVE
> decision first (i.e. what we present to the guest), then we should
> decide how to implement it (userspace, KVM accelerated, impossible and
> give up).

I am changing things with the addition of KM support for dual mode but 
that might not be the right approach. Let's talk over it when you reach 
the end of the QEMU patchset.

I will keep in mind that we should know exactly what KVM supports
before the machine starts. That is : not to abort QEMU if we can not 
satisfy the interrupt mode chosen at CAS time. It might be possible
to fallback to XIVE emulated mode, I think that is where the problem
is but I haven't looked at it closely.

C.