diff mbox series

[05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode

Message ID 20190107184331.8429-6-clg@kaod.org
State Changes Requested
Headers show
Series KVM: PPC: Book3S HV: add XIVE native exploitation mode | expand

Commit Message

Cédric Le Goater Jan. 7, 2019, 6:43 p.m. UTC
This is the basic framework for the new KVM device supporting the XIVE
native exploitation mode. The user interface exposes a new capability
and a new KVM device to be used by QEMU.

Internally, the interface to the new KVM device is protected with a
new interrupt mode: KVMPPC_IRQ_XIVE.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/kvm_host.h   |   2 +
 arch/powerpc/include/asm/kvm_ppc.h    |  21 ++
 arch/powerpc/kvm/book3s_xive.h        |   3 +
 include/uapi/linux/kvm.h              |   3 +
 arch/powerpc/kvm/book3s.c             |   7 +-
 arch/powerpc/kvm/book3s_xive_native.c | 332 ++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c            |  30 +++
 arch/powerpc/kvm/Makefile             |   2 +-
 8 files changed, 398 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_xive_native.c

Comments

Paul Mackerras Jan. 22, 2019, 5:05 a.m. UTC | #1
On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> This is the basic framework for the new KVM device supporting the XIVE
> native exploitation mode. The user interface exposes a new capability
> and a new KVM device to be used by QEMU.

[snip]
> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>  #ifdef CONFIG_KVM_XIVE
>  	if (xive_enabled()) {
>  		kvmppc_xive_init_module();
> +		kvmppc_xive_native_init_module();
>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
> +		kvm_register_device_ops(&kvm_xive_native_ops,
> +					KVM_DEV_TYPE_XIVE);

I think we want tighter conditions on initializing the xive_native
stuff and creating the xive device class.  We could have
xive_enabled() returning true in a guest, and this code will get
called both by PR KVM and HV KVM (and HV KVM no longer implies that we
are running bare metal).

> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
>  static void kvmppc_book3s_exit(void)
>  {
>  #ifdef CONFIG_KVM_XICS
> -	if (xive_enabled())
> +	if (xive_enabled()) {
>  		kvmppc_xive_exit_module();
> +		kvmppc_xive_native_exit_module();

Same comment here.

Paul.
Cédric Le Goater Jan. 23, 2019, 4:28 p.m. UTC | #2
On 1/22/19 6:05 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>> This is the basic framework for the new KVM device supporting the XIVE
>> native exploitation mode. The user interface exposes a new capability
>> and a new KVM device to be used by QEMU.
> 
> [snip]
>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>  #ifdef CONFIG_KVM_XIVE
>>  	if (xive_enabled()) {
>>  		kvmppc_xive_init_module();
>> +		kvmppc_xive_native_init_module();
>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
>> +		kvm_register_device_ops(&kvm_xive_native_ops,
>> +					KVM_DEV_TYPE_XIVE);
> 
> I think we want tighter conditions on initializing the xive_native
> stuff and creating the xive device class.  We could have
> xive_enabled() returning true in a guest, and this code will get
> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> are running bare metal).

Ah yes, I agree. I haven't addressed at all the nested flavor. I have 
some questions about this that I will ask in summary email you sent. 

Thanks,

C. 
  

> 
>> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
>>  static void kvmppc_book3s_exit(void)
>>  {
>>  #ifdef CONFIG_KVM_XICS
>> -	if (xive_enabled())
>> +	if (xive_enabled()) {
>>  		kvmppc_xive_exit_module();
>> +		kvmppc_xive_native_exit_module();
> 
> Same comment here.
> 
> Paul.
>
Cédric Le Goater Jan. 28, 2019, 5:35 p.m. UTC | #3
On 1/22/19 6:05 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>> This is the basic framework for the new KVM device supporting the XIVE
>> native exploitation mode. The user interface exposes a new capability
>> and a new KVM device to be used by QEMU.
> 
> [snip]
>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>  #ifdef CONFIG_KVM_XIVE
>>  	if (xive_enabled()) {
>>  		kvmppc_xive_init_module();
>> +		kvmppc_xive_native_init_module();
>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
>> +		kvm_register_device_ops(&kvm_xive_native_ops,
>> +					KVM_DEV_TYPE_XIVE);
> 
> I think we want tighter conditions on initializing the xive_native
> stuff and creating the xive device class.  We could have
> xive_enabled() returning true in a guest, and this code will get
> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> are running bare metal).

So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
(L1) obviously crashes trying to call OPAL. I have tighten the test with : 

	if (xive_enabled() && !kvmhv_on_pseries()) {

for now.

As this is a problem today in 5.0.x, I will send a patch for it if you think
it is correct. I don't think we should bother taking care of the PR case
on P9. Should we ? 

Thanks,

C.
 
>> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
>>  static void kvmppc_book3s_exit(void)
>>  {
>>  #ifdef CONFIG_KVM_XICS
>> -	if (xive_enabled())
>> +	if (xive_enabled()) {
>>  		kvmppc_xive_exit_module();
>> +		kvmppc_xive_native_exit_module();
> 
> Same comment here.
> 
> Paul.
>
Paul Mackerras Jan. 30, 2019, 4:29 a.m. UTC | #4
On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
> On 1/22/19 6:05 AM, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> >> This is the basic framework for the new KVM device supporting the XIVE
> >> native exploitation mode. The user interface exposes a new capability
> >> and a new KVM device to be used by QEMU.
> > 
> > [snip]
> >> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
> >>  #ifdef CONFIG_KVM_XIVE
> >>  	if (xive_enabled()) {
> >>  		kvmppc_xive_init_module();
> >> +		kvmppc_xive_native_init_module();
> >>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
> >> +		kvm_register_device_ops(&kvm_xive_native_ops,
> >> +					KVM_DEV_TYPE_XIVE);
> > 
> > I think we want tighter conditions on initializing the xive_native
> > stuff and creating the xive device class.  We could have
> > xive_enabled() returning true in a guest, and this code will get
> > called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> > are running bare metal).
> 
> So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
> (L1) obviously crashes trying to call OPAL. I have tighten the test with : 
> 
> 	if (xive_enabled() && !kvmhv_on_pseries()) {
> 
> for now.
> 
> As this is a problem today in 5.0.x, I will send a patch for it if you think

How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
with kernel_irqchip=on in a nested guest and it works just fine.  What
exactly did you test?

> it is correct. I don't think we should bother taking care of the PR case
> on P9. Should we ? 

We do need to take care of PR KVM on P9, since it is the only form of
nested KVM that works inside a host in HPT mode.

Paul.
Cédric Le Goater Jan. 30, 2019, 7:01 a.m. UTC | #5
On 1/30/19 5:29 AM, Paul Mackerras wrote:
> On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
>> On 1/22/19 6:05 AM, Paul Mackerras wrote:
>>> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>>>> This is the basic framework for the new KVM device supporting the XIVE
>>>> native exploitation mode. The user interface exposes a new capability
>>>> and a new KVM device to be used by QEMU.
>>>
>>> [snip]
>>>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>>>  #ifdef CONFIG_KVM_XIVE
>>>>  	if (xive_enabled()) {
>>>>  		kvmppc_xive_init_module();
>>>> +		kvmppc_xive_native_init_module();
>>>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
>>>> +		kvm_register_device_ops(&kvm_xive_native_ops,
>>>> +					KVM_DEV_TYPE_XIVE);
>>>
>>> I think we want tighter conditions on initializing the xive_native
>>> stuff and creating the xive device class.  We could have
>>> xive_enabled() returning true in a guest, and this code will get
>>> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
>>> are running bare metal).
>>
>> So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
>> (L1) obviously crashes trying to call OPAL. I have tighten the test with : 
>>
>> 	if (xive_enabled() && !kvmhv_on_pseries()) {
>>
>> for now.
>>
>> As this is a problem today in 5.0.x, I will send a patch for it if you think
> 
> How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
> with kernel_irqchip=on in a nested guest and it works just fine.  What
> exactly did you test?

L0: Linux 5.0.0-rc3 (+ KVM HV)
L1:     QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3 (+ KVM HV)
L2:          QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3

L1 crashes when L2 starts and tries to initialize the KVM IRQ device as 
it does an OPAL call and its running under SLOF. See below.

I don't understand how L2 can work with kernel_irqchip=on. Could you
please explain ? 

>> it is correct. I don't think we should bother taking care of the PR case
>> on P9. Should we ? 
> 
> We do need to take care of PR KVM on P9, since it is the only form of
> nested KVM that works inside a host in HPT mode.

ok. That is the test case. There are quite a few combinations now.

Thanks,

C.

[   49.547056] Oops: Exception in kernel mode, sig: 4 [#1]
[   49.555101] LE SMP NR_CPUS=2048 NUMA pSeries
[   49.555132] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vmx_crypto crct10dif_vpmsum crc32c_vpmsum kvm_hv kvm sch_fq_codel ip_tables x_tables autofs4 virtio_net net_failover failover virtio_scsi
[   49.555335] CPU: 9 PID: 2162 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.0.0-rc3+ #53
[   49.555378] NIP:  c0000000000a7548 LR: c0000000000a4044 CTR: c0000000000a24b0
[   49.555421] REGS: c0000003ad71f8a0 TRAP: 0700   Not tainted  (5.0.0-rc3+)
[   49.555456] MSR:  8000000000041033 <SF,ME,IR,DR,RI,LE>  CR: 44222822  XER: 20040000
[   49.555501] CFAR: c0000000000a2508 IRQMASK: 0 
[   49.555501] GPR00: 0000000000000087 c0000003ad71fb30 c00000000175f700 000000000000000b 
[   49.555501] GPR04: 0000000000000000 0000000000000000 c0000003f88d4000 000000000000000b 
[   49.555501] GPR08: 00000003fd800000 000000000000000b 0000000000000800 0000000000000031 
[   49.555501] GPR12: 8000000000001002 c000000007ff3280 0000000000000000 0000000000000000 
[   49.555501] GPR16: 00007ffff8d2bd60 0000000000000000 000002c9896d7800 00007ffff8d2b970 
[   49.555501] GPR20: 000002c95c876f90 000002c95c876fa0 000002c95c876f80 000002c95c876f70 
[   49.555501] GPR24: 000002c95cf4f648 ffffffffffffffff c0000003ab3e4058 00000000006000c0 
[   49.555501] GPR28: 000000000000000b c0000003ab3e0000 0000000000000000 c0000003f88d0000 
[   49.555883] NIP [c0000000000a7548] opal_xive_alloc_vp_block+0x50/0x68
[   49.555919] LR [c0000000000a4044] opal_return+0x0/0x48
[   49.555947] Call Trace:
[   49.555964] [c0000003ad71fb30] [c0000000000a250c] xive_native_alloc_vp_block+0x5c/0x1c0 (unreliable)
[   49.556019] [c0000003ad71fbc0] [c00800000430c0c0] kvmppc_xive_create+0x98/0x168 [kvm]
[   49.556065] [c0000003ad71fc00] [c0080000042f9fcc] kvm_vm_ioctl+0x474/0xa00 [kvm]
[   49.556113] [c0000003ad71fd10] [c000000000423a64] do_vfs_ioctl+0xd4/0x8e0
[   49.556153] [c0000003ad71fdb0] [c000000000424334] ksys_ioctl+0xc4/0x110
[   49.556190] [c0000003ad71fe00] [c0000000004243a8] sys_ioctl+0x28/0x80
[   49.556230] [c0000003ad71fe20] [c00000000000b288] system_call+0x5c/0x70
[   49.556265] Instruction dump:
[   49.556288] 60000000 7d600026 91610008 39600000 616b8000 f98d0980 7d8c5878 7d810164 
[   49.556332] e9628098 7d6803a6 39600031 7d8c5878 <7d9b4ba6> e96280b0 e98b0008 e84b0000 
[   49.556378] ---[ end trace ac7420a6784de93b ]---
Paul Mackerras Jan. 31, 2019, 3:01 a.m. UTC | #6
On Wed, Jan 30, 2019 at 08:01:22AM +0100, Cédric Le Goater wrote:
> On 1/30/19 5:29 AM, Paul Mackerras wrote:
> > On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
> >> On 1/22/19 6:05 AM, Paul Mackerras wrote:
> >>> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> >>>> This is the basic framework for the new KVM device supporting the XIVE
> >>>> native exploitation mode. The user interface exposes a new capability
> >>>> and a new KVM device to be used by QEMU.
> >>>
> >>> [snip]
> >>>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
> >>>>  #ifdef CONFIG_KVM_XIVE
> >>>>  	if (xive_enabled()) {
> >>>>  		kvmppc_xive_init_module();
> >>>> +		kvmppc_xive_native_init_module();
> >>>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
> >>>> +		kvm_register_device_ops(&kvm_xive_native_ops,
> >>>> +					KVM_DEV_TYPE_XIVE);
> >>>
> >>> I think we want tighter conditions on initializing the xive_native
> >>> stuff and creating the xive device class.  We could have
> >>> xive_enabled() returning true in a guest, and this code will get
> >>> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> >>> are running bare metal).
> >>
> >> So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
> >> (L1) obviously crashes trying to call OPAL. I have tighten the test with : 
> >>
> >> 	if (xive_enabled() && !kvmhv_on_pseries()) {
> >>
> >> for now.
> >>
> >> As this is a problem today in 5.0.x, I will send a patch for it if you think
> > 
> > How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
> > with kernel_irqchip=on in a nested guest and it works just fine.  What
> > exactly did you test?
> 
> L0: Linux 5.0.0-rc3 (+ KVM HV)
> L1:     QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3 (+ KVM HV)
> L2:          QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3
> 
> L1 crashes when L2 starts and tries to initialize the KVM IRQ device as 
> it does an OPAL call and its running under SLOF. See below.

OK, you must have a QEMU that advertises XIVE to the guest (L1).  In
that case I can see that L1 would try to do XICS-on-XIVE, which won't
work.  We need to fix that.  Unfortunately the XICS-on-XICS emulation
won't work as is in L1 either, but I think we can fix that by
disabling the real-mode XICS hcall handling.

> I don't understand how L2 can work with kernel_irqchip=on. Could you
> please explain ? 

If QEMU decides to advertise XIVE to the L2 guest and the L2 guest can
do XIVE, then the only possibility is to use the XIVE software
emulation in QEMU, and if kernel_irqchip=on has been specified
explicitly, maybe QEMU decides to terminate the guest rather than
implicitly turning off kernel_irqchip.

If QEMU decides not to advertise XIVE to the L2 guest, or the L2 guest
can't do XIVE, then we could use the XICS-on-XICS emulation in L1 as
long as either (a) L1 is not using XIVE, or (b) we modify the
XICS-on-XICS code to avoid using any XICS or XIVE access (i.e. just
using calls to generic kernel facilities).

Ultimately, if the spapr xive backend code in the kernel could be
extended to provide all the low-level functions that the XICS-on-XIVE
code needs, then we could do XICS-on-XIVE in a guest.

Paul.
Cédric Le Goater Feb. 1, 2019, 5:03 p.m. UTC | #7
On 1/31/19 4:01 AM, Paul Mackerras wrote:
> On Wed, Jan 30, 2019 at 08:01:22AM +0100, Cédric Le Goater wrote:
>> On 1/30/19 5:29 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
>>>> On 1/22/19 6:05 AM, Paul Mackerras wrote:
>>>>> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>>>>>> This is the basic framework for the new KVM device supporting the XIVE
>>>>>> native exploitation mode. The user interface exposes a new capability
>>>>>> and a new KVM device to be used by QEMU.
>>>>>
>>>>> [snip]
>>>>>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>>>>>  #ifdef CONFIG_KVM_XIVE
>>>>>>  	if (xive_enabled()) {
>>>>>>  		kvmppc_xive_init_module();
>>>>>> +		kvmppc_xive_native_init_module();
>>>>>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
>>>>>> +		kvm_register_device_ops(&kvm_xive_native_ops,
>>>>>> +					KVM_DEV_TYPE_XIVE);
>>>>>
>>>>> I think we want tighter conditions on initializing the xive_native
>>>>> stuff and creating the xive device class.  We could have
>>>>> xive_enabled() returning true in a guest, and this code will get
>>>>> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
>>>>> are running bare metal).
>>>>
>>>> So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
>>>> (L1) obviously crashes trying to call OPAL. I have tighten the test with : 
>>>>
>>>> 	if (xive_enabled() && !kvmhv_on_pseries()) {
>>>>
>>>> for now.
>>>>
>>>> As this is a problem today in 5.0.x, I will send a patch for it if you think
>>>
>>> How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
>>> with kernel_irqchip=on in a nested guest and it works just fine.  What
>>> exactly did you test?
>>
>> L0: Linux 5.0.0-rc3 (+ KVM HV)
>> L1:     QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3 (+ KVM HV)
>> L2:          QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3
>>
>> L1 crashes when L2 starts and tries to initialize the KVM IRQ device as 
>> it does an OPAL call and its running under SLOF. See below.
> 
> OK, you must have a QEMU that advertises XIVE to the guest (L1). 

XIVE is not advertised if QEMU is started with 'ic-mode=xics' 

> In
> that case I can see that L1 would try to do XICS-on-XIVE, which won't
> work.  We need to fix that.  Unfortunately the XICS-on-XICS emulation
> won't work as is in L1 either, but I think we can fix that by
> disabling the real-mode XICS hcall handling.

I have added some tests on kvm-hv, using kvmhv_on_pseries(), to disable 
the KVM XICS-on-XIVE device in a L1 guest running as hypervisor and 
to instead register the old KVM XICS device. 

If the L1 is started in KVM XICS mode, L2 can now run with KVM XICS.
All seem fine. I booted two guests with disk and network. 

But I am still "a bit" confused with what is being done at each 
hypervisor level. It's not obvious to follow at all even with traces.
 
>> I don't understand how L2 can work with kernel_irqchip=on. Could you
>> please explain ? 
> 
> If QEMU decides to advertise XIVE to the L2 guest and the L2 guest can
> do XIVE, then the only possibility is to use the XIVE software
> emulation in QEMU, and if kernel_irqchip=on has been specified
> explicitly, maybe QEMU decides to terminate the guest rather than
> implicitly turning off kernel_irqchip.

we can do that by disabling the KVM XIVE device when under kvmhv_on_pseries().

> If QEMU decides not to advertise XIVE to the L2 guest, or the L2 guest
> can't do XIVE, then we could use the XICS-on-XICS emulation in L1 as
> long as either (a) L1 is not using XIVE, or (b) we modify the
> XICS-on-XICS code to avoid using any XICS or XIVE access (i.e. just
> using calls to generic kernel facilities).

(a) is what I did above I think

May be we should consider having nested version of the KVM devices 
when under kvmhv_on_pseries(). With some sort of backend ops to
modify the relation with the parent hypervisor : PowerNV/Linux or 
pseries/Linux. 

> Ultimately, if the spapr xive backend code in the kernel could be
> extended to provide all the low-level functions that the XICS-on-XIVE
> code needs, then we could do XICS-on-XIVE in a guest.

What about a XIVE on XIVE ? 

Propagating the ESB pages to a nested guest seems feasible if not 
already done. The hcalls could be forwarded to the L1 QEMU ? The 
problematic part is handling the XIVE VP block.

C.
David Gibson Feb. 4, 2019, 4:25 a.m. UTC | #8
On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> This is the basic framework for the new KVM device supporting the XIVE
> native exploitation mode. The user interface exposes a new capability
> and a new KVM device to be used by QEMU.
> 
> Internally, the interface to the new KVM device is protected with a
> new interrupt mode: KVMPPC_IRQ_XIVE.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/asm/kvm_host.h   |   2 +
>  arch/powerpc/include/asm/kvm_ppc.h    |  21 ++
>  arch/powerpc/kvm/book3s_xive.h        |   3 +
>  include/uapi/linux/kvm.h              |   3 +
>  arch/powerpc/kvm/book3s.c             |   7 +-
>  arch/powerpc/kvm/book3s_xive_native.c | 332 ++++++++++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c            |  30 +++
>  arch/powerpc/kvm/Makefile             |   2 +-
>  8 files changed, 398 insertions(+), 2 deletions(-)
>  create mode 100644 arch/powerpc/kvm/book3s_xive_native.c
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 0f98f00da2ea..c522e8274ad9 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -220,6 +220,7 @@ extern struct kvm_device_ops kvm_xics_ops;
>  struct kvmppc_xive;
>  struct kvmppc_xive_vcpu;
>  extern struct kvm_device_ops kvm_xive_ops;
> +extern struct kvm_device_ops kvm_xive_native_ops;
>  
>  struct kvmppc_passthru_irqmap;
>  
> @@ -446,6 +447,7 @@ struct kvmppc_passthru_irqmap {
>  #define KVMPPC_IRQ_DEFAULT	0
>  #define KVMPPC_IRQ_MPIC		1
>  #define KVMPPC_IRQ_XICS		2 /* Includes a XIVE option */
> +#define KVMPPC_IRQ_XIVE		3 /* XIVE native exploitation mode */
>  
>  #define MMIO_HPTE_CACHE_SIZE	4
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index eb0d79f0ca45..1bb313f238fe 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -591,6 +591,18 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
>  extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
>  			       int level, bool line_status);
>  extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
> +
> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
> +}
> +
> +extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> +				    struct kvm_vcpu *vcpu, u32 cpu);
> +extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
> +extern void kvmppc_xive_native_init_module(void);
> +extern void kvmppc_xive_native_exit_module(void);
> +
>  #else
>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
>  				       u32 priority) { return -1; }
> @@ -614,6 +626,15 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval) { retur
>  static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
>  				      int level, bool line_status) { return -ENODEV; }
>  static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
> +
> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
> +	{ return 0; }
> +static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> +						  struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; }
> +static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
> +static inline void kvmppc_xive_native_init_module(void) { }
> +static inline void kvmppc_xive_native_exit_module(void) { }
> +
>  #endif /* CONFIG_KVM_XIVE */
>  
>  /*
> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> index 10c4aa5cd010..5f22415520b4 100644
> --- a/arch/powerpc/kvm/book3s_xive.h
> +++ b/arch/powerpc/kvm/book3s_xive.h
> @@ -12,6 +12,9 @@
>  #ifdef CONFIG_KVM_XICS
>  #include "book3s_xics.h"
>  
> +#define KVMPPC_XIVE_FIRST_IRQ	0
> +#define KVMPPC_XIVE_NR_IRQS	KVMPPC_XICS_NR_IRQS
> +
>  /*
>   * State for one guest irq source.
>   *
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6d4ea4b6c922..52bf74a1616e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_ARM_VM_IPA_SIZE 165
>  #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
>  #define KVM_CAP_HYPERV_CPUID 167
> +#define KVM_CAP_PPC_IRQ_XIVE 168
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1211,6 +1212,8 @@ enum kvm_device_type {
>  #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
>  	KVM_DEV_TYPE_ARM_VGIC_ITS,
>  #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
> +	KVM_DEV_TYPE_XIVE,
> +#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index bd1a677dd9e4..de7eed191107 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>  #ifdef CONFIG_KVM_XIVE
>  	if (xive_enabled()) {
>  		kvmppc_xive_init_module();
> +		kvmppc_xive_native_init_module();
>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
> +		kvm_register_device_ops(&kvm_xive_native_ops,
> +					KVM_DEV_TYPE_XIVE);
>  	} else
>  #endif
>  		kvm_register_device_ops(&kvm_xics_ops, KVM_DEV_TYPE_XICS);
> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
>  static void kvmppc_book3s_exit(void)
>  {
>  #ifdef CONFIG_KVM_XICS
> -	if (xive_enabled())
> +	if (xive_enabled()) {
>  		kvmppc_xive_exit_module();
> +		kvmppc_xive_native_exit_module();
> +	}
>  #endif
>  #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
>  	kvmppc_book3s_exit_pr();
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> new file mode 100644
> index 000000000000..115143e76c45
> --- /dev/null
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -0,0 +1,332 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2017-2019, IBM Corporation.
> + */
> +
> +#define pr_fmt(fmt) "xive-kvm: " fmt
> +
> +#include <linux/anon_inodes.h>
> +#include <linux/kernel.h>
> +#include <linux/kvm_host.h>
> +#include <linux/err.h>
> +#include <linux/gfp.h>
> +#include <linux/spinlock.h>
> +#include <linux/delay.h>
> +#include <linux/percpu.h>
> +#include <linux/cpumask.h>
> +#include <asm/uaccess.h>
> +#include <asm/kvm_book3s.h>
> +#include <asm/kvm_ppc.h>
> +#include <asm/hvcall.h>
> +#include <asm/xics.h>
> +#include <asm/xive.h>
> +#include <asm/xive-regs.h>
> +#include <asm/debug.h>
> +#include <asm/debugfs.h>
> +#include <asm/time.h>
> +#include <asm/opal.h>
> +
> +#include <linux/debugfs.h>
> +#include <linux/seq_file.h>
> +
> +#include "book3s_xive.h"
> +
> +static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
> +{
> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> +	struct xive_q *q = &xc->queues[prio];
> +
> +	xive_native_disable_queue(xc->vp_id, q, prio);
> +	if (q->qpage) {
> +		put_page(virt_to_page(q->qpage));
> +		q->qpage = NULL;
> +	}
> +}
> +
> +void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> +	int i;
> +
> +	if (!kvmppc_xive_enabled(vcpu))
> +		return;
> +
> +	if (!xc)
> +		return;
> +
> +	pr_devel("native_cleanup_vcpu(cpu=%d)\n", xc->server_num);
> +
> +	/* Ensure no interrupt is still routed to that VP */
> +	xc->valid = false;
> +	kvmppc_xive_disable_vcpu_interrupts(vcpu);
> +
> +	/* Disable the VP */
> +	xive_native_disable_vp(xc->vp_id);
> +
> +	/* Free the queues & associated interrupts */
> +	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
> +		/* Free the escalation irq */
> +		if (xc->esc_virq[i]) {
> +			free_irq(xc->esc_virq[i], vcpu);
> +			irq_dispose_mapping(xc->esc_virq[i]);
> +			kfree(xc->esc_virq_names[i]);
> +			xc->esc_virq[i] = 0;
> +		}
> +
> +		/* Free the queue */
> +		xive_native_cleanup_queue(vcpu, i);
> +	}
> +
> +	/* Free the VP */
> +	kfree(xc);
> +
> +	/* Cleanup the vcpu */
> +	vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
> +	vcpu->arch.xive_vcpu = NULL;
> +}
> +
> +int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> +				    struct kvm_vcpu *vcpu, u32 cpu)

Why do we need both a *vcpu and a cpu number as an integer?

> +{
> +	struct kvmppc_xive *xive = dev->private;
> +	struct kvmppc_xive_vcpu *xc;
> +	int rc;
> +
> +	pr_devel("native_connect_vcpu(cpu=%d)\n", cpu);
> +
> +	if (dev->ops != &kvm_xive_native_ops) {
> +		pr_devel("Wrong ops !\n");
> +		return -EPERM;
> +	}
> +	if (xive->kvm != vcpu->kvm)
> +		return -EPERM;
> +	if (vcpu->arch.irq_type)

Please use an explicit == / != here so we don't have to remember which
symbolic value corresponds to 0.

> +		return -EBUSY;
> +	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
> +		pr_devel("Duplicate !\n");
> +		return -EEXIST;
> +	}
> +	if (cpu >= KVM_MAX_VCPUS) {
> +		pr_devel("Out of bounds !\n");
> +		return -EINVAL;
> +	}
> +	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
> +	if (!xc)
> +		return -ENOMEM;
> +
> +	mutex_lock(&vcpu->kvm->lock);
> +	vcpu->arch.xive_vcpu = xc;
> +	xc->xive = xive;
> +	xc->vcpu = vcpu;
> +	xc->server_num = cpu;
> +	xc->vp_id = xive->vp_base + cpu;
> +	xc->valid = true;
> +
> +	rc = xive_native_get_vp_info(xc->vp_id, &xc->vp_cam, &xc->vp_chip_id);
> +	if (rc) {
> +		pr_err("Failed to get VP info from OPAL: %d\n", rc);
> +		goto bail;
> +	}
> +
> +	/*
> +	 * Enable the VP first as the single escalation mode will
> +	 * affect escalation interrupts numbering
> +	 */
> +	rc = xive_native_enable_vp(xc->vp_id, xive->single_escalation);
> +	if (rc) {
> +		pr_err("Failed to enable VP in OPAL: %d\n", rc);
> +		goto bail;
> +	}
> +
> +	/* Configure VCPU fields for use by assembly push/pull */
> +	vcpu->arch.xive_saved_state.w01 = cpu_to_be64(0xff000000);
> +	vcpu->arch.xive_cam_word = cpu_to_be32(xc->vp_cam | TM_QW1W2_VO);
> +
> +	/* TODO: initialize queues ? */
> +
> +bail:
> +	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
> +	mutex_unlock(&vcpu->kvm->lock);
> +	if (rc)
> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
> +
> +	return rc;
> +}
> +
> +static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
> +				       struct kvm_device_attr *attr)
> +{
> +	return -ENXIO;
> +}
> +
> +static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
> +				       struct kvm_device_attr *attr)
> +{
> +	return -ENXIO;
> +}
> +
> +static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
> +				       struct kvm_device_attr *attr)
> +{
> +	return -ENXIO;
> +}
> +
> +static void kvmppc_xive_native_free(struct kvm_device *dev)
> +{
> +	struct kvmppc_xive *xive = dev->private;
> +	struct kvm *kvm = xive->kvm;
> +	int i;
> +
> +	debugfs_remove(xive->dentry);
> +
> +	pr_devel("Destroying xive native for partition\n");
> +
> +	if (kvm)
> +		kvm->arch.xive = NULL;
> +
> +	/* Mask and free interrupts */
> +	for (i = 0; i <= xive->max_sbid; i++) {
> +		if (xive->src_blocks[i])
> +			kvmppc_xive_free_sources(xive->src_blocks[i]);
> +		kfree(xive->src_blocks[i]);
> +		xive->src_blocks[i] = NULL;
> +	}
> +
> +	if (xive->vp_base != XIVE_INVALID_VP)
> +		xive_native_free_vp_block(xive->vp_base);
> +
> +	kfree(xive);
> +	kfree(dev);
> +}
> +
> +static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
> +{
> +	struct kvmppc_xive *xive;
> +	struct kvm *kvm = dev->kvm;
> +	int ret = 0;
> +
> +	pr_devel("Creating xive native for partition\n");
> +
> +	if (kvm->arch.xive)
> +		return -EEXIST;
> +
> +	xive = kzalloc(sizeof(*xive), GFP_KERNEL);
> +	if (!xive)
> +		return -ENOMEM;
> +
> +	dev->private = xive;
> +	xive->dev = dev;
> +	xive->kvm = kvm;
> +	kvm->arch.xive = xive;
> +
> +	/* We use the default queue size set by the host */
> +	xive->q_order = xive_native_default_eq_shift();
> +	if (xive->q_order < PAGE_SHIFT)
> +		xive->q_page_order = 0;
> +	else
> +		xive->q_page_order = xive->q_order - PAGE_SHIFT;
> +
> +	/* Allocate a bunch of VPs */
> +	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> +	pr_devel("VP_Base=%x\n", xive->vp_base);
> +
> +	if (xive->vp_base == XIVE_INVALID_VP)
> +		ret = -ENOMEM;
> +
> +	xive->single_escalation = xive_native_has_single_escalation();
> +
> +	if (ret)
> +		kfree(xive);
> +
> +	return ret;
> +}
> +
> +static int xive_native_debug_show(struct seq_file *m, void *private)
> +{
> +	struct kvmppc_xive *xive = m->private;
> +	struct kvm *kvm = xive->kvm;
> +	struct kvm_vcpu *vcpu;
> +	unsigned int i;
> +
> +	if (!kvm)
> +		return 0;
> +
> +	seq_puts(m, "=========\nVCPU state\n=========\n");
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> +
> +		if (!xc)
> +			continue;
> +
> +		seq_printf(m, "cpu server %#x NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x\n",
> +			   xc->server_num,
> +			   vcpu->arch.xive_saved_state.nsr,
> +			   vcpu->arch.xive_saved_state.cppr,
> +			   vcpu->arch.xive_saved_state.ipb,
> +			   vcpu->arch.xive_saved_state.pipr,
> +			   vcpu->arch.xive_saved_state.w01,
> +			   (u32) vcpu->arch.xive_cam_word);
> +
> +		kvmppc_xive_debug_show_queues(m, vcpu);
> +	}
> +
> +	return 0;
> +}
> +
> +static int xive_native_debug_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, xive_native_debug_show, inode->i_private);
> +}
> +
> +static const struct file_operations xive_native_debug_fops = {
> +	.open = xive_native_debug_open,
> +	.read = seq_read,
> +	.llseek = seq_lseek,
> +	.release = single_release,
> +};
> +
> +static void xive_native_debugfs_init(struct kvmppc_xive *xive)
> +{
> +	char *name;
> +
> +	name = kasprintf(GFP_KERNEL, "kvm-xive-%p", xive);
> +	if (!name) {
> +		pr_err("%s: no memory for name\n", __func__);
> +		return;
> +	}
> +
> +	xive->dentry = debugfs_create_file(name, 0444, powerpc_debugfs_root,
> +					   xive, &xive_native_debug_fops);
> +
> +	pr_debug("%s: created %s\n", __func__, name);
> +	kfree(name);
> +}
> +
> +static void kvmppc_xive_native_init(struct kvm_device *dev)
> +{
> +	struct kvmppc_xive *xive = (struct kvmppc_xive *)dev->private;
> +
> +	/* Register some debug interfaces */
> +	xive_native_debugfs_init(xive);
> +}
> +
> +struct kvm_device_ops kvm_xive_native_ops = {
> +	.name = "kvm-xive-native",
> +	.create = kvmppc_xive_native_create,
> +	.init = kvmppc_xive_native_init,
> +	.destroy = kvmppc_xive_native_free,
> +	.set_attr = kvmppc_xive_native_set_attr,
> +	.get_attr = kvmppc_xive_native_get_attr,
> +	.has_attr = kvmppc_xive_native_has_attr,
> +};
> +
> +void kvmppc_xive_native_init_module(void)
> +{
> +	;
> +}
> +
> +void kvmppc_xive_native_exit_module(void)
> +{
> +	;
> +}
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index b90a7d154180..01d526e15e9d 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -566,6 +566,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_PPC_ENABLE_HCALL:
>  #ifdef CONFIG_KVM_XICS
>  	case KVM_CAP_IRQ_XICS:
> +#endif
> +#ifdef CONFIG_KVM_XIVE
> +	case KVM_CAP_PPC_IRQ_XIVE:
>  #endif
>  	case KVM_CAP_PPC_GET_CPU_CHAR:
>  		r = 1;
> @@ -753,6 +756,9 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>  		else
>  			kvmppc_xics_free_icp(vcpu);
>  		break;
> +	case KVMPPC_IRQ_XIVE:
> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
> +		break;
>  	}
>  
>  	kvmppc_core_vcpu_free(vcpu);
> @@ -1941,6 +1947,30 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>  		break;
>  	}
>  #endif /* CONFIG_KVM_XICS */
> +#ifdef CONFIG_KVM_XIVE
> +	case KVM_CAP_PPC_IRQ_XIVE: {
> +		struct fd f;
> +		struct kvm_device *dev;
> +
> +		r = -EBADF;
> +		f = fdget(cap->args[0]);
> +		if (!f.file)
> +			break;
> +
> +		r = -ENXIO;
> +		if (!xive_enabled())
> +			break;
> +
> +		r = -EPERM;
> +		dev = kvm_device_from_filp(f.file);
> +		if (dev)
> +			r = kvmppc_xive_native_connect_vcpu(dev, vcpu,
> +							    cap->args[1]);
> +
> +		fdput(f);
> +		break;
> +	}
> +#endif /* CONFIG_KVM_XIVE */
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>  	case KVM_CAP_PPC_FWNMI:
>  		r = -EINVAL;
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index 64f1135e7732..806cbe488410 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -99,7 +99,7 @@ endif
>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>  	book3s_xics.o
>  
> -kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o
> +kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o book3s_xive_native.o
>  kvm-book3s_64-objs-$(CONFIG_SPAPR_TCE_IOMMU) += book3s_64_vio.o
>  
>  kvm-book3s_64-module-objs := \
Cédric Le Goater Feb. 4, 2019, 11:19 a.m. UTC | #9
On 2/4/19 5:25 AM, David Gibson wrote:
> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>> This is the basic framework for the new KVM device supporting the XIVE
>> native exploitation mode. The user interface exposes a new capability
>> and a new KVM device to be used by QEMU.
>>
>> Internally, the interface to the new KVM device is protected with a
>> new interrupt mode: KVMPPC_IRQ_XIVE.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/asm/kvm_host.h   |   2 +
>>  arch/powerpc/include/asm/kvm_ppc.h    |  21 ++
>>  arch/powerpc/kvm/book3s_xive.h        |   3 +
>>  include/uapi/linux/kvm.h              |   3 +
>>  arch/powerpc/kvm/book3s.c             |   7 +-
>>  arch/powerpc/kvm/book3s_xive_native.c | 332 ++++++++++++++++++++++++++
>>  arch/powerpc/kvm/powerpc.c            |  30 +++
>>  arch/powerpc/kvm/Makefile             |   2 +-
>>  8 files changed, 398 insertions(+), 2 deletions(-)
>>  create mode 100644 arch/powerpc/kvm/book3s_xive_native.c
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 0f98f00da2ea..c522e8274ad9 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -220,6 +220,7 @@ extern struct kvm_device_ops kvm_xics_ops;
>>  struct kvmppc_xive;
>>  struct kvmppc_xive_vcpu;
>>  extern struct kvm_device_ops kvm_xive_ops;
>> +extern struct kvm_device_ops kvm_xive_native_ops;
>>  
>>  struct kvmppc_passthru_irqmap;
>>  
>> @@ -446,6 +447,7 @@ struct kvmppc_passthru_irqmap {
>>  #define KVMPPC_IRQ_DEFAULT	0
>>  #define KVMPPC_IRQ_MPIC		1
>>  #define KVMPPC_IRQ_XICS		2 /* Includes a XIVE option */
>> +#define KVMPPC_IRQ_XIVE		3 /* XIVE native exploitation mode */
>>  
>>  #define MMIO_HPTE_CACHE_SIZE	4
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index eb0d79f0ca45..1bb313f238fe 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -591,6 +591,18 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
>>  extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
>>  			       int level, bool line_status);
>>  extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
>> +
>> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
>> +{
>> +	return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
>> +}
>> +
>> +extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>> +				    struct kvm_vcpu *vcpu, u32 cpu);
>> +extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
>> +extern void kvmppc_xive_native_init_module(void);
>> +extern void kvmppc_xive_native_exit_module(void);
>> +
>>  #else
>>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
>>  				       u32 priority) { return -1; }
>> @@ -614,6 +626,15 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval) { retur
>>  static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
>>  				      int level, bool line_status) { return -ENODEV; }
>>  static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
>> +
>> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
>> +	{ return 0; }
>> +static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>> +						  struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; }
>> +static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
>> +static inline void kvmppc_xive_native_init_module(void) { }
>> +static inline void kvmppc_xive_native_exit_module(void) { }
>> +
>>  #endif /* CONFIG_KVM_XIVE */
>>  
>>  /*
>> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
>> index 10c4aa5cd010..5f22415520b4 100644
>> --- a/arch/powerpc/kvm/book3s_xive.h
>> +++ b/arch/powerpc/kvm/book3s_xive.h
>> @@ -12,6 +12,9 @@
>>  #ifdef CONFIG_KVM_XICS
>>  #include "book3s_xics.h"
>>  
>> +#define KVMPPC_XIVE_FIRST_IRQ	0
>> +#define KVMPPC_XIVE_NR_IRQS	KVMPPC_XICS_NR_IRQS
>> +
>>  /*
>>   * State for one guest irq source.
>>   *
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 6d4ea4b6c922..52bf74a1616e 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
>>  #define KVM_CAP_ARM_VM_IPA_SIZE 165
>>  #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
>>  #define KVM_CAP_HYPERV_CPUID 167
>> +#define KVM_CAP_PPC_IRQ_XIVE 168
>>  
>>  #ifdef KVM_CAP_IRQ_ROUTING
>>  
>> @@ -1211,6 +1212,8 @@ enum kvm_device_type {
>>  #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
>>  	KVM_DEV_TYPE_ARM_VGIC_ITS,
>>  #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
>> +	KVM_DEV_TYPE_XIVE,
>> +#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
>> index bd1a677dd9e4..de7eed191107 100644
>> --- a/arch/powerpc/kvm/book3s.c
>> +++ b/arch/powerpc/kvm/book3s.c
>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>  #ifdef CONFIG_KVM_XIVE
>>  	if (xive_enabled()) {
>>  		kvmppc_xive_init_module();
>> +		kvmppc_xive_native_init_module();
>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
>> +		kvm_register_device_ops(&kvm_xive_native_ops,
>> +					KVM_DEV_TYPE_XIVE);
>>  	} else
>>  #endif
>>  		kvm_register_device_ops(&kvm_xics_ops, KVM_DEV_TYPE_XICS);
>> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
>>  static void kvmppc_book3s_exit(void)
>>  {
>>  #ifdef CONFIG_KVM_XICS
>> -	if (xive_enabled())
>> +	if (xive_enabled()) {
>>  		kvmppc_xive_exit_module();
>> +		kvmppc_xive_native_exit_module();
>> +	}
>>  #endif
>>  #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
>>  	kvmppc_book3s_exit_pr();
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>> new file mode 100644
>> index 000000000000..115143e76c45
>> --- /dev/null
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -0,0 +1,332 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2017-2019, IBM Corporation.
>> + */
>> +
>> +#define pr_fmt(fmt) "xive-kvm: " fmt
>> +
>> +#include <linux/anon_inodes.h>
>> +#include <linux/kernel.h>
>> +#include <linux/kvm_host.h>
>> +#include <linux/err.h>
>> +#include <linux/gfp.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/delay.h>
>> +#include <linux/percpu.h>
>> +#include <linux/cpumask.h>
>> +#include <asm/uaccess.h>
>> +#include <asm/kvm_book3s.h>
>> +#include <asm/kvm_ppc.h>
>> +#include <asm/hvcall.h>
>> +#include <asm/xics.h>
>> +#include <asm/xive.h>
>> +#include <asm/xive-regs.h>
>> +#include <asm/debug.h>
>> +#include <asm/debugfs.h>
>> +#include <asm/time.h>
>> +#include <asm/opal.h>
>> +
>> +#include <linux/debugfs.h>
>> +#include <linux/seq_file.h>
>> +
>> +#include "book3s_xive.h"
>> +
>> +static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
>> +{
>> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>> +	struct xive_q *q = &xc->queues[prio];
>> +
>> +	xive_native_disable_queue(xc->vp_id, q, prio);
>> +	if (q->qpage) {
>> +		put_page(virt_to_page(q->qpage));
>> +		q->qpage = NULL;
>> +	}
>> +}
>> +
>> +void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>> +	int i;
>> +
>> +	if (!kvmppc_xive_enabled(vcpu))
>> +		return;
>> +
>> +	if (!xc)
>> +		return;
>> +
>> +	pr_devel("native_cleanup_vcpu(cpu=%d)\n", xc->server_num);
>> +
>> +	/* Ensure no interrupt is still routed to that VP */
>> +	xc->valid = false;
>> +	kvmppc_xive_disable_vcpu_interrupts(vcpu);
>> +
>> +	/* Disable the VP */
>> +	xive_native_disable_vp(xc->vp_id);
>> +
>> +	/* Free the queues & associated interrupts */
>> +	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
>> +		/* Free the escalation irq */
>> +		if (xc->esc_virq[i]) {
>> +			free_irq(xc->esc_virq[i], vcpu);
>> +			irq_dispose_mapping(xc->esc_virq[i]);
>> +			kfree(xc->esc_virq_names[i]);
>> +			xc->esc_virq[i] = 0;
>> +		}
>> +
>> +		/* Free the queue */
>> +		xive_native_cleanup_queue(vcpu, i);
>> +	}
>> +
>> +	/* Free the VP */
>> +	kfree(xc);
>> +
>> +	/* Cleanup the vcpu */
>> +	vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
>> +	vcpu->arch.xive_vcpu = NULL;
>> +}
>> +
>> +int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>> +				    struct kvm_vcpu *vcpu, u32 cpu)
> 
> Why do we need both a *vcpu and a cpu number as an integer?

To be in sync with the other similar routines : kvmppc_xics_connect_vcpu() 
and kvmppc_xive_connect_vcpu().

But if we consider that this 'cpu' parameter is always in sync with 
vcpu->vcpu_id, we could remove it from the KVM ioctl call I suppose.

Should we do the same for the other routines ? 
 
>> +{
>> +	struct kvmppc_xive *xive = dev->private;
>> +	struct kvmppc_xive_vcpu *xc;
>> +	int rc;
>> +
>> +	pr_devel("native_connect_vcpu(cpu=%d)\n", cpu);
>> +
>> +	if (dev->ops != &kvm_xive_native_ops) {
>> +		pr_devel("Wrong ops !\n");
>> +		return -EPERM;
>> +	}
>> +	if (xive->kvm != vcpu->kvm)
>> +		return -EPERM;
>> +	if (vcpu->arch.irq_type)
> 
> Please use an explicit == / != here so we don't have to remember which
> symbolic value corresponds to 0.

ok. I agree.

Thanks,

C. 


> 
>> +		return -EBUSY;
>> +	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
>> +		pr_devel("Duplicate !\n");
>> +		return -EEXIST;
>> +	}
>> +	if (cpu >= KVM_MAX_VCPUS) {
>> +		pr_devel("Out of bounds !\n");
>> +		return -EINVAL;
>> +	}
>> +	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
>> +	if (!xc)
>> +		return -ENOMEM;
>> +
>> +	mutex_lock(&vcpu->kvm->lock);
>> +	vcpu->arch.xive_vcpu = xc;
>> +	xc->xive = xive;
>> +	xc->vcpu = vcpu;
>> +	xc->server_num = cpu;
>> +	xc->vp_id = xive->vp_base + cpu;
>> +	xc->valid = true;
>> +
>> +	rc = xive_native_get_vp_info(xc->vp_id, &xc->vp_cam, &xc->vp_chip_id);
>> +	if (rc) {
>> +		pr_err("Failed to get VP info from OPAL: %d\n", rc);
>> +		goto bail;
>> +	}
>> +
>> +	/*
>> +	 * Enable the VP first as the single escalation mode will
>> +	 * affect escalation interrupts numbering
>> +	 */
>> +	rc = xive_native_enable_vp(xc->vp_id, xive->single_escalation);
>> +	if (rc) {
>> +		pr_err("Failed to enable VP in OPAL: %d\n", rc);
>> +		goto bail;
>> +	}
>> +
>> +	/* Configure VCPU fields for use by assembly push/pull */
>> +	vcpu->arch.xive_saved_state.w01 = cpu_to_be64(0xff000000);
>> +	vcpu->arch.xive_cam_word = cpu_to_be32(xc->vp_cam | TM_QW1W2_VO);
>> +
>> +	/* TODO: initialize queues ? */
>> +
>> +bail:
>> +	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
>> +	mutex_unlock(&vcpu->kvm->lock);
>> +	if (rc)
>> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
>> +
>> +	return rc;
>> +}
>> +
>> +static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>> +				       struct kvm_device_attr *attr)
>> +{
>> +	return -ENXIO;
>> +}
>> +
>> +static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
>> +				       struct kvm_device_attr *attr)
>> +{
>> +	return -ENXIO;
>> +}
>> +
>> +static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>> +				       struct kvm_device_attr *attr)
>> +{
>> +	return -ENXIO;
>> +}
>> +
>> +static void kvmppc_xive_native_free(struct kvm_device *dev)
>> +{
>> +	struct kvmppc_xive *xive = dev->private;
>> +	struct kvm *kvm = xive->kvm;
>> +	int i;
>> +
>> +	debugfs_remove(xive->dentry);
>> +
>> +	pr_devel("Destroying xive native for partition\n");
>> +
>> +	if (kvm)
>> +		kvm->arch.xive = NULL;
>> +
>> +	/* Mask and free interrupts */
>> +	for (i = 0; i <= xive->max_sbid; i++) {
>> +		if (xive->src_blocks[i])
>> +			kvmppc_xive_free_sources(xive->src_blocks[i]);
>> +		kfree(xive->src_blocks[i]);
>> +		xive->src_blocks[i] = NULL;
>> +	}
>> +
>> +	if (xive->vp_base != XIVE_INVALID_VP)
>> +		xive_native_free_vp_block(xive->vp_base);
>> +
>> +	kfree(xive);
>> +	kfree(dev);
>> +}
>> +
>> +static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>> +{
>> +	struct kvmppc_xive *xive;
>> +	struct kvm *kvm = dev->kvm;
>> +	int ret = 0;
>> +
>> +	pr_devel("Creating xive native for partition\n");
>> +
>> +	if (kvm->arch.xive)
>> +		return -EEXIST;
>> +
>> +	xive = kzalloc(sizeof(*xive), GFP_KERNEL);
>> +	if (!xive)
>> +		return -ENOMEM;
>> +
>> +	dev->private = xive;
>> +	xive->dev = dev;
>> +	xive->kvm = kvm;
>> +	kvm->arch.xive = xive;
>> +
>> +	/* We use the default queue size set by the host */
>> +	xive->q_order = xive_native_default_eq_shift();
>> +	if (xive->q_order < PAGE_SHIFT)
>> +		xive->q_page_order = 0;
>> +	else
>> +		xive->q_page_order = xive->q_order - PAGE_SHIFT;
>> +
>> +	/* Allocate a bunch of VPs */
>> +	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
>> +	pr_devel("VP_Base=%x\n", xive->vp_base);
>> +
>> +	if (xive->vp_base == XIVE_INVALID_VP)
>> +		ret = -ENOMEM;
>> +
>> +	xive->single_escalation = xive_native_has_single_escalation();
>> +
>> +	if (ret)
>> +		kfree(xive);
>> +
>> +	return ret;
>> +}
>> +
>> +static int xive_native_debug_show(struct seq_file *m, void *private)
>> +{
>> +	struct kvmppc_xive *xive = m->private;
>> +	struct kvm *kvm = xive->kvm;
>> +	struct kvm_vcpu *vcpu;
>> +	unsigned int i;
>> +
>> +	if (!kvm)
>> +		return 0;
>> +
>> +	seq_puts(m, "=========\nVCPU state\n=========\n");
>> +
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>> +
>> +		if (!xc)
>> +			continue;
>> +
>> +		seq_printf(m, "cpu server %#x NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x\n",
>> +			   xc->server_num,
>> +			   vcpu->arch.xive_saved_state.nsr,
>> +			   vcpu->arch.xive_saved_state.cppr,
>> +			   vcpu->arch.xive_saved_state.ipb,
>> +			   vcpu->arch.xive_saved_state.pipr,
>> +			   vcpu->arch.xive_saved_state.w01,
>> +			   (u32) vcpu->arch.xive_cam_word);
>> +
>> +		kvmppc_xive_debug_show_queues(m, vcpu);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int xive_native_debug_open(struct inode *inode, struct file *file)
>> +{
>> +	return single_open(file, xive_native_debug_show, inode->i_private);
>> +}
>> +
>> +static const struct file_operations xive_native_debug_fops = {
>> +	.open = xive_native_debug_open,
>> +	.read = seq_read,
>> +	.llseek = seq_lseek,
>> +	.release = single_release,
>> +};
>> +
>> +static void xive_native_debugfs_init(struct kvmppc_xive *xive)
>> +{
>> +	char *name;
>> +
>> +	name = kasprintf(GFP_KERNEL, "kvm-xive-%p", xive);
>> +	if (!name) {
>> +		pr_err("%s: no memory for name\n", __func__);
>> +		return;
>> +	}
>> +
>> +	xive->dentry = debugfs_create_file(name, 0444, powerpc_debugfs_root,
>> +					   xive, &xive_native_debug_fops);
>> +
>> +	pr_debug("%s: created %s\n", __func__, name);
>> +	kfree(name);
>> +}
>> +
>> +static void kvmppc_xive_native_init(struct kvm_device *dev)
>> +{
>> +	struct kvmppc_xive *xive = (struct kvmppc_xive *)dev->private;
>> +
>> +	/* Register some debug interfaces */
>> +	xive_native_debugfs_init(xive);
>> +}
>> +
>> +struct kvm_device_ops kvm_xive_native_ops = {
>> +	.name = "kvm-xive-native",
>> +	.create = kvmppc_xive_native_create,
>> +	.init = kvmppc_xive_native_init,
>> +	.destroy = kvmppc_xive_native_free,
>> +	.set_attr = kvmppc_xive_native_set_attr,
>> +	.get_attr = kvmppc_xive_native_get_attr,
>> +	.has_attr = kvmppc_xive_native_has_attr,
>> +};
>> +
>> +void kvmppc_xive_native_init_module(void)
>> +{
>> +	;
>> +}
>> +
>> +void kvmppc_xive_native_exit_module(void)
>> +{
>> +	;
>> +}
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index b90a7d154180..01d526e15e9d 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -566,6 +566,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  	case KVM_CAP_PPC_ENABLE_HCALL:
>>  #ifdef CONFIG_KVM_XICS
>>  	case KVM_CAP_IRQ_XICS:
>> +#endif
>> +#ifdef CONFIG_KVM_XIVE
>> +	case KVM_CAP_PPC_IRQ_XIVE:
>>  #endif
>>  	case KVM_CAP_PPC_GET_CPU_CHAR:
>>  		r = 1;
>> @@ -753,6 +756,9 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>>  		else
>>  			kvmppc_xics_free_icp(vcpu);
>>  		break;
>> +	case KVMPPC_IRQ_XIVE:
>> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
>> +		break;
>>  	}
>>  
>>  	kvmppc_core_vcpu_free(vcpu);
>> @@ -1941,6 +1947,30 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>>  		break;
>>  	}
>>  #endif /* CONFIG_KVM_XICS */
>> +#ifdef CONFIG_KVM_XIVE
>> +	case KVM_CAP_PPC_IRQ_XIVE: {
>> +		struct fd f;
>> +		struct kvm_device *dev;
>> +
>> +		r = -EBADF;
>> +		f = fdget(cap->args[0]);
>> +		if (!f.file)
>> +			break;
>> +
>> +		r = -ENXIO;
>> +		if (!xive_enabled())
>> +			break;
>> +
>> +		r = -EPERM;
>> +		dev = kvm_device_from_filp(f.file);
>> +		if (dev)
>> +			r = kvmppc_xive_native_connect_vcpu(dev, vcpu,
>> +							    cap->args[1]);
>> +
>> +		fdput(f);
>> +		break;
>> +	}
>> +#endif /* CONFIG_KVM_XIVE */
>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>>  	case KVM_CAP_PPC_FWNMI:
>>  		r = -EINVAL;
>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>> index 64f1135e7732..806cbe488410 100644
>> --- a/arch/powerpc/kvm/Makefile
>> +++ b/arch/powerpc/kvm/Makefile
>> @@ -99,7 +99,7 @@ endif
>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>  	book3s_xics.o
>>  
>> -kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o
>> +kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o book3s_xive_native.o
>>  kvm-book3s_64-objs-$(CONFIG_SPAPR_TCE_IOMMU) += book3s_64_vio.o
>>  
>>  kvm-book3s_64-module-objs := \
>
David Gibson Feb. 5, 2019, 5:26 a.m. UTC | #10
On Mon, Feb 04, 2019 at 12:19:07PM +0100, Cédric Le Goater wrote:
> On 2/4/19 5:25 AM, David Gibson wrote:
> > On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> >> This is the basic framework for the new KVM device supporting the XIVE
> >> native exploitation mode. The user interface exposes a new capability
> >> and a new KVM device to be used by QEMU.
> >>
> >> Internally, the interface to the new KVM device is protected with a
> >> new interrupt mode: KVMPPC_IRQ_XIVE.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  arch/powerpc/include/asm/kvm_host.h   |   2 +
> >>  arch/powerpc/include/asm/kvm_ppc.h    |  21 ++
> >>  arch/powerpc/kvm/book3s_xive.h        |   3 +
> >>  include/uapi/linux/kvm.h              |   3 +
> >>  arch/powerpc/kvm/book3s.c             |   7 +-
> >>  arch/powerpc/kvm/book3s_xive_native.c | 332 ++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c            |  30 +++
> >>  arch/powerpc/kvm/Makefile             |   2 +-
> >>  8 files changed, 398 insertions(+), 2 deletions(-)
> >>  create mode 100644 arch/powerpc/kvm/book3s_xive_native.c
> >>
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index 0f98f00da2ea..c522e8274ad9 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -220,6 +220,7 @@ extern struct kvm_device_ops kvm_xics_ops;
> >>  struct kvmppc_xive;
> >>  struct kvmppc_xive_vcpu;
> >>  extern struct kvm_device_ops kvm_xive_ops;
> >> +extern struct kvm_device_ops kvm_xive_native_ops;
> >>  
> >>  struct kvmppc_passthru_irqmap;
> >>  
> >> @@ -446,6 +447,7 @@ struct kvmppc_passthru_irqmap {
> >>  #define KVMPPC_IRQ_DEFAULT	0
> >>  #define KVMPPC_IRQ_MPIC		1
> >>  #define KVMPPC_IRQ_XICS		2 /* Includes a XIVE option */
> >> +#define KVMPPC_IRQ_XIVE		3 /* XIVE native exploitation mode */
> >>  
> >>  #define MMIO_HPTE_CACHE_SIZE	4
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index eb0d79f0ca45..1bb313f238fe 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -591,6 +591,18 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
> >>  extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
> >>  			       int level, bool line_status);
> >>  extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
> >> +
> >> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
> >> +{
> >> +	return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
> >> +}
> >> +
> >> +extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> >> +				    struct kvm_vcpu *vcpu, u32 cpu);
> >> +extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
> >> +extern void kvmppc_xive_native_init_module(void);
> >> +extern void kvmppc_xive_native_exit_module(void);
> >> +
> >>  #else
> >>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
> >>  				       u32 priority) { return -1; }
> >> @@ -614,6 +626,15 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval) { retur
> >>  static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
> >>  				      int level, bool line_status) { return -ENODEV; }
> >>  static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
> >> +
> >> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
> >> +	{ return 0; }
> >> +static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> >> +						  struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; }
> >> +static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
> >> +static inline void kvmppc_xive_native_init_module(void) { }
> >> +static inline void kvmppc_xive_native_exit_module(void) { }
> >> +
> >>  #endif /* CONFIG_KVM_XIVE */
> >>  
> >>  /*
> >> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> >> index 10c4aa5cd010..5f22415520b4 100644
> >> --- a/arch/powerpc/kvm/book3s_xive.h
> >> +++ b/arch/powerpc/kvm/book3s_xive.h
> >> @@ -12,6 +12,9 @@
> >>  #ifdef CONFIG_KVM_XICS
> >>  #include "book3s_xics.h"
> >>  
> >> +#define KVMPPC_XIVE_FIRST_IRQ	0
> >> +#define KVMPPC_XIVE_NR_IRQS	KVMPPC_XICS_NR_IRQS
> >> +
> >>  /*
> >>   * State for one guest irq source.
> >>   *
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 6d4ea4b6c922..52bf74a1616e 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
> >>  #define KVM_CAP_ARM_VM_IPA_SIZE 165
> >>  #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
> >>  #define KVM_CAP_HYPERV_CPUID 167
> >> +#define KVM_CAP_PPC_IRQ_XIVE 168
> >>  
> >>  #ifdef KVM_CAP_IRQ_ROUTING
> >>  
> >> @@ -1211,6 +1212,8 @@ enum kvm_device_type {
> >>  #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
> >>  	KVM_DEV_TYPE_ARM_VGIC_ITS,
> >>  #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
> >> +	KVM_DEV_TYPE_XIVE,
> >> +#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>  
> >> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> >> index bd1a677dd9e4..de7eed191107 100644
> >> --- a/arch/powerpc/kvm/book3s.c
> >> +++ b/arch/powerpc/kvm/book3s.c
> >> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
> >>  #ifdef CONFIG_KVM_XIVE
> >>  	if (xive_enabled()) {
> >>  		kvmppc_xive_init_module();
> >> +		kvmppc_xive_native_init_module();
> >>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
> >> +		kvm_register_device_ops(&kvm_xive_native_ops,
> >> +					KVM_DEV_TYPE_XIVE);
> >>  	} else
> >>  #endif
> >>  		kvm_register_device_ops(&kvm_xics_ops, KVM_DEV_TYPE_XICS);
> >> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
> >>  static void kvmppc_book3s_exit(void)
> >>  {
> >>  #ifdef CONFIG_KVM_XICS
> >> -	if (xive_enabled())
> >> +	if (xive_enabled()) {
> >>  		kvmppc_xive_exit_module();
> >> +		kvmppc_xive_native_exit_module();
> >> +	}
> >>  #endif
> >>  #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
> >>  	kvmppc_book3s_exit_pr();
> >> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> >> new file mode 100644
> >> index 000000000000..115143e76c45
> >> --- /dev/null
> >> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> >> @@ -0,0 +1,332 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Copyright (c) 2017-2019, IBM Corporation.
> >> + */
> >> +
> >> +#define pr_fmt(fmt) "xive-kvm: " fmt
> >> +
> >> +#include <linux/anon_inodes.h>
> >> +#include <linux/kernel.h>
> >> +#include <linux/kvm_host.h>
> >> +#include <linux/err.h>
> >> +#include <linux/gfp.h>
> >> +#include <linux/spinlock.h>
> >> +#include <linux/delay.h>
> >> +#include <linux/percpu.h>
> >> +#include <linux/cpumask.h>
> >> +#include <asm/uaccess.h>
> >> +#include <asm/kvm_book3s.h>
> >> +#include <asm/kvm_ppc.h>
> >> +#include <asm/hvcall.h>
> >> +#include <asm/xics.h>
> >> +#include <asm/xive.h>
> >> +#include <asm/xive-regs.h>
> >> +#include <asm/debug.h>
> >> +#include <asm/debugfs.h>
> >> +#include <asm/time.h>
> >> +#include <asm/opal.h>
> >> +
> >> +#include <linux/debugfs.h>
> >> +#include <linux/seq_file.h>
> >> +
> >> +#include "book3s_xive.h"
> >> +
> >> +static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
> >> +{
> >> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> >> +	struct xive_q *q = &xc->queues[prio];
> >> +
> >> +	xive_native_disable_queue(xc->vp_id, q, prio);
> >> +	if (q->qpage) {
> >> +		put_page(virt_to_page(q->qpage));
> >> +		q->qpage = NULL;
> >> +	}
> >> +}
> >> +
> >> +void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
> >> +{
> >> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> >> +	int i;
> >> +
> >> +	if (!kvmppc_xive_enabled(vcpu))
> >> +		return;
> >> +
> >> +	if (!xc)
> >> +		return;
> >> +
> >> +	pr_devel("native_cleanup_vcpu(cpu=%d)\n", xc->server_num);
> >> +
> >> +	/* Ensure no interrupt is still routed to that VP */
> >> +	xc->valid = false;
> >> +	kvmppc_xive_disable_vcpu_interrupts(vcpu);
> >> +
> >> +	/* Disable the VP */
> >> +	xive_native_disable_vp(xc->vp_id);
> >> +
> >> +	/* Free the queues & associated interrupts */
> >> +	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
> >> +		/* Free the escalation irq */
> >> +		if (xc->esc_virq[i]) {
> >> +			free_irq(xc->esc_virq[i], vcpu);
> >> +			irq_dispose_mapping(xc->esc_virq[i]);
> >> +			kfree(xc->esc_virq_names[i]);
> >> +			xc->esc_virq[i] = 0;
> >> +		}
> >> +
> >> +		/* Free the queue */
> >> +		xive_native_cleanup_queue(vcpu, i);
> >> +	}
> >> +
> >> +	/* Free the VP */
> >> +	kfree(xc);
> >> +
> >> +	/* Cleanup the vcpu */
> >> +	vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
> >> +	vcpu->arch.xive_vcpu = NULL;
> >> +}
> >> +
> >> +int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> >> +				    struct kvm_vcpu *vcpu, u32 cpu)
> > 
> > Why do we need both a *vcpu and a cpu number as an integer?
> 
> To be in sync with the other similar routines : kvmppc_xics_connect_vcpu() 
> and kvmppc_xive_connect_vcpu().
> 
> But if we consider that this 'cpu' parameter is always in sync with 
> vcpu->vcpu_id, we could remove it from the KVM ioctl call I suppose.
> 
> Should we do the same for the other routines ? 

Well.. I don't know why they are that way.  Is that int parameter the
XICS server number, which need not be the same as the vcpu_id ?  Can
we set that arbitrarily in XIVE as well?

It looks like these parameters need a name change at least to make it
clearer what the distinction is.

> >> +{
> >> +	struct kvmppc_xive *xive = dev->private;
> >> +	struct kvmppc_xive_vcpu *xc;
> >> +	int rc;
> >> +
> >> +	pr_devel("native_connect_vcpu(cpu=%d)\n", cpu);
> >> +
> >> +	if (dev->ops != &kvm_xive_native_ops) {
> >> +		pr_devel("Wrong ops !\n");
> >> +		return -EPERM;
> >> +	}
> >> +	if (xive->kvm != vcpu->kvm)
> >> +		return -EPERM;
> >> +	if (vcpu->arch.irq_type)
> > 
> > Please use an explicit == / != here so we don't have to remember which
> > symbolic value corresponds to 0.
> 
> ok. I agree.
> 
> Thanks,
> 
> C. 
> 
> 
> > 
> >> +		return -EBUSY;
> >> +	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
> >> +		pr_devel("Duplicate !\n");
> >> +		return -EEXIST;
> >> +	}
> >> +	if (cpu >= KVM_MAX_VCPUS) {
> >> +		pr_devel("Out of bounds !\n");
> >> +		return -EINVAL;
> >> +	}
> >> +	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
> >> +	if (!xc)
> >> +		return -ENOMEM;
> >> +
> >> +	mutex_lock(&vcpu->kvm->lock);
> >> +	vcpu->arch.xive_vcpu = xc;
> >> +	xc->xive = xive;
> >> +	xc->vcpu = vcpu;
> >> +	xc->server_num = cpu;
> >> +	xc->vp_id = xive->vp_base + cpu;
> >> +	xc->valid = true;
> >> +
> >> +	rc = xive_native_get_vp_info(xc->vp_id, &xc->vp_cam, &xc->vp_chip_id);
> >> +	if (rc) {
> >> +		pr_err("Failed to get VP info from OPAL: %d\n", rc);
> >> +		goto bail;
> >> +	}
> >> +
> >> +	/*
> >> +	 * Enable the VP first as the single escalation mode will
> >> +	 * affect escalation interrupts numbering
> >> +	 */
> >> +	rc = xive_native_enable_vp(xc->vp_id, xive->single_escalation);
> >> +	if (rc) {
> >> +		pr_err("Failed to enable VP in OPAL: %d\n", rc);
> >> +		goto bail;
> >> +	}
> >> +
> >> +	/* Configure VCPU fields for use by assembly push/pull */
> >> +	vcpu->arch.xive_saved_state.w01 = cpu_to_be64(0xff000000);
> >> +	vcpu->arch.xive_cam_word = cpu_to_be32(xc->vp_cam | TM_QW1W2_VO);
> >> +
> >> +	/* TODO: initialize queues ? */
> >> +
> >> +bail:
> >> +	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
> >> +	mutex_unlock(&vcpu->kvm->lock);
> >> +	if (rc)
> >> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
> >> +
> >> +	return rc;
> >> +}
> >> +
> >> +static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
> >> +				       struct kvm_device_attr *attr)
> >> +{
> >> +	return -ENXIO;
> >> +}
> >> +
> >> +static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
> >> +				       struct kvm_device_attr *attr)
> >> +{
> >> +	return -ENXIO;
> >> +}
> >> +
> >> +static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
> >> +				       struct kvm_device_attr *attr)
> >> +{
> >> +	return -ENXIO;
> >> +}
> >> +
> >> +static void kvmppc_xive_native_free(struct kvm_device *dev)
> >> +{
> >> +	struct kvmppc_xive *xive = dev->private;
> >> +	struct kvm *kvm = xive->kvm;
> >> +	int i;
> >> +
> >> +	debugfs_remove(xive->dentry);
> >> +
> >> +	pr_devel("Destroying xive native for partition\n");
> >> +
> >> +	if (kvm)
> >> +		kvm->arch.xive = NULL;
> >> +
> >> +	/* Mask and free interrupts */
> >> +	for (i = 0; i <= xive->max_sbid; i++) {
> >> +		if (xive->src_blocks[i])
> >> +			kvmppc_xive_free_sources(xive->src_blocks[i]);
> >> +		kfree(xive->src_blocks[i]);
> >> +		xive->src_blocks[i] = NULL;
> >> +	}
> >> +
> >> +	if (xive->vp_base != XIVE_INVALID_VP)
> >> +		xive_native_free_vp_block(xive->vp_base);
> >> +
> >> +	kfree(xive);
> >> +	kfree(dev);
> >> +}
> >> +
> >> +static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
> >> +{
> >> +	struct kvmppc_xive *xive;
> >> +	struct kvm *kvm = dev->kvm;
> >> +	int ret = 0;
> >> +
> >> +	pr_devel("Creating xive native for partition\n");
> >> +
> >> +	if (kvm->arch.xive)
> >> +		return -EEXIST;
> >> +
> >> +	xive = kzalloc(sizeof(*xive), GFP_KERNEL);
> >> +	if (!xive)
> >> +		return -ENOMEM;
> >> +
> >> +	dev->private = xive;
> >> +	xive->dev = dev;
> >> +	xive->kvm = kvm;
> >> +	kvm->arch.xive = xive;
> >> +
> >> +	/* We use the default queue size set by the host */
> >> +	xive->q_order = xive_native_default_eq_shift();
> >> +	if (xive->q_order < PAGE_SHIFT)
> >> +		xive->q_page_order = 0;
> >> +	else
> >> +		xive->q_page_order = xive->q_order - PAGE_SHIFT;
> >> +
> >> +	/* Allocate a bunch of VPs */
> >> +	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> >> +	pr_devel("VP_Base=%x\n", xive->vp_base);
> >> +
> >> +	if (xive->vp_base == XIVE_INVALID_VP)
> >> +		ret = -ENOMEM;
> >> +
> >> +	xive->single_escalation = xive_native_has_single_escalation();
> >> +
> >> +	if (ret)
> >> +		kfree(xive);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int xive_native_debug_show(struct seq_file *m, void *private)
> >> +{
> >> +	struct kvmppc_xive *xive = m->private;
> >> +	struct kvm *kvm = xive->kvm;
> >> +	struct kvm_vcpu *vcpu;
> >> +	unsigned int i;
> >> +
> >> +	if (!kvm)
> >> +		return 0;
> >> +
> >> +	seq_puts(m, "=========\nVCPU state\n=========\n");
> >> +
> >> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> >> +		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> >> +
> >> +		if (!xc)
> >> +			continue;
> >> +
> >> +		seq_printf(m, "cpu server %#x NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x\n",
> >> +			   xc->server_num,
> >> +			   vcpu->arch.xive_saved_state.nsr,
> >> +			   vcpu->arch.xive_saved_state.cppr,
> >> +			   vcpu->arch.xive_saved_state.ipb,
> >> +			   vcpu->arch.xive_saved_state.pipr,
> >> +			   vcpu->arch.xive_saved_state.w01,
> >> +			   (u32) vcpu->arch.xive_cam_word);
> >> +
> >> +		kvmppc_xive_debug_show_queues(m, vcpu);
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int xive_native_debug_open(struct inode *inode, struct file *file)
> >> +{
> >> +	return single_open(file, xive_native_debug_show, inode->i_private);
> >> +}
> >> +
> >> +static const struct file_operations xive_native_debug_fops = {
> >> +	.open = xive_native_debug_open,
> >> +	.read = seq_read,
> >> +	.llseek = seq_lseek,
> >> +	.release = single_release,
> >> +};
> >> +
> >> +static void xive_native_debugfs_init(struct kvmppc_xive *xive)
> >> +{
> >> +	char *name;
> >> +
> >> +	name = kasprintf(GFP_KERNEL, "kvm-xive-%p", xive);
> >> +	if (!name) {
> >> +		pr_err("%s: no memory for name\n", __func__);
> >> +		return;
> >> +	}
> >> +
> >> +	xive->dentry = debugfs_create_file(name, 0444, powerpc_debugfs_root,
> >> +					   xive, &xive_native_debug_fops);
> >> +
> >> +	pr_debug("%s: created %s\n", __func__, name);
> >> +	kfree(name);
> >> +}
> >> +
> >> +static void kvmppc_xive_native_init(struct kvm_device *dev)
> >> +{
> >> +	struct kvmppc_xive *xive = (struct kvmppc_xive *)dev->private;
> >> +
> >> +	/* Register some debug interfaces */
> >> +	xive_native_debugfs_init(xive);
> >> +}
> >> +
> >> +struct kvm_device_ops kvm_xive_native_ops = {
> >> +	.name = "kvm-xive-native",
> >> +	.create = kvmppc_xive_native_create,
> >> +	.init = kvmppc_xive_native_init,
> >> +	.destroy = kvmppc_xive_native_free,
> >> +	.set_attr = kvmppc_xive_native_set_attr,
> >> +	.get_attr = kvmppc_xive_native_get_attr,
> >> +	.has_attr = kvmppc_xive_native_has_attr,
> >> +};
> >> +
> >> +void kvmppc_xive_native_init_module(void)
> >> +{
> >> +	;
> >> +}
> >> +
> >> +void kvmppc_xive_native_exit_module(void)
> >> +{
> >> +	;
> >> +}
> >> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >> index b90a7d154180..01d526e15e9d 100644
> >> --- a/arch/powerpc/kvm/powerpc.c
> >> +++ b/arch/powerpc/kvm/powerpc.c
> >> @@ -566,6 +566,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  	case KVM_CAP_PPC_ENABLE_HCALL:
> >>  #ifdef CONFIG_KVM_XICS
> >>  	case KVM_CAP_IRQ_XICS:
> >> +#endif
> >> +#ifdef CONFIG_KVM_XIVE
> >> +	case KVM_CAP_PPC_IRQ_XIVE:
> >>  #endif
> >>  	case KVM_CAP_PPC_GET_CPU_CHAR:
> >>  		r = 1;
> >> @@ -753,6 +756,9 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
> >>  		else
> >>  			kvmppc_xics_free_icp(vcpu);
> >>  		break;
> >> +	case KVMPPC_IRQ_XIVE:
> >> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
> >> +		break;
> >>  	}
> >>  
> >>  	kvmppc_core_vcpu_free(vcpu);
> >> @@ -1941,6 +1947,30 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> >>  		break;
> >>  	}
> >>  #endif /* CONFIG_KVM_XICS */
> >> +#ifdef CONFIG_KVM_XIVE
> >> +	case KVM_CAP_PPC_IRQ_XIVE: {
> >> +		struct fd f;
> >> +		struct kvm_device *dev;
> >> +
> >> +		r = -EBADF;
> >> +		f = fdget(cap->args[0]);
> >> +		if (!f.file)
> >> +			break;
> >> +
> >> +		r = -ENXIO;
> >> +		if (!xive_enabled())
> >> +			break;
> >> +
> >> +		r = -EPERM;
> >> +		dev = kvm_device_from_filp(f.file);
> >> +		if (dev)
> >> +			r = kvmppc_xive_native_connect_vcpu(dev, vcpu,
> >> +							    cap->args[1]);
> >> +
> >> +		fdput(f);
> >> +		break;
> >> +	}
> >> +#endif /* CONFIG_KVM_XIVE */
> >>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >>  	case KVM_CAP_PPC_FWNMI:
> >>  		r = -EINVAL;
> >> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >> index 64f1135e7732..806cbe488410 100644
> >> --- a/arch/powerpc/kvm/Makefile
> >> +++ b/arch/powerpc/kvm/Makefile
> >> @@ -99,7 +99,7 @@ endif
> >>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>  	book3s_xics.o
> >>  
> >> -kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o
> >> +kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o book3s_xive_native.o
> >>  kvm-book3s_64-objs-$(CONFIG_SPAPR_TCE_IOMMU) += book3s_64_vio.o
> >>  
> >>  kvm-book3s_64-module-objs := \
> > 
>
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 0f98f00da2ea..c522e8274ad9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -220,6 +220,7 @@  extern struct kvm_device_ops kvm_xics_ops;
 struct kvmppc_xive;
 struct kvmppc_xive_vcpu;
 extern struct kvm_device_ops kvm_xive_ops;
+extern struct kvm_device_ops kvm_xive_native_ops;
 
 struct kvmppc_passthru_irqmap;
 
@@ -446,6 +447,7 @@  struct kvmppc_passthru_irqmap {
 #define KVMPPC_IRQ_DEFAULT	0
 #define KVMPPC_IRQ_MPIC		1
 #define KVMPPC_IRQ_XICS		2 /* Includes a XIVE option */
+#define KVMPPC_IRQ_XIVE		3 /* XIVE native exploitation mode */
 
 #define MMIO_HPTE_CACHE_SIZE	4
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index eb0d79f0ca45..1bb313f238fe 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -591,6 +591,18 @@  extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
 extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
 			       int level, bool line_status);
 extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
+
+static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
+{
+	return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
+}
+
+extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+				    struct kvm_vcpu *vcpu, u32 cpu);
+extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
+extern void kvmppc_xive_native_init_module(void);
+extern void kvmppc_xive_native_exit_module(void);
+
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
 				       u32 priority) { return -1; }
@@ -614,6 +626,15 @@  static inline int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval) { retur
 static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
 				      int level, bool line_status) { return -ENODEV; }
 static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
+
+static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
+	{ return 0; }
+static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+						  struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; }
+static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
+static inline void kvmppc_xive_native_init_module(void) { }
+static inline void kvmppc_xive_native_exit_module(void) { }
+
 #endif /* CONFIG_KVM_XIVE */
 
 /*
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 10c4aa5cd010..5f22415520b4 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -12,6 +12,9 @@ 
 #ifdef CONFIG_KVM_XICS
 #include "book3s_xics.h"
 
+#define KVMPPC_XIVE_FIRST_IRQ	0
+#define KVMPPC_XIVE_NR_IRQS	KVMPPC_XICS_NR_IRQS
+
 /*
  * State for one guest irq source.
  *
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6d4ea4b6c922..52bf74a1616e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -988,6 +988,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_VM_IPA_SIZE 165
 #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
 #define KVM_CAP_HYPERV_CPUID 167
+#define KVM_CAP_PPC_IRQ_XIVE 168
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1211,6 +1212,8 @@  enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
 	KVM_DEV_TYPE_ARM_VGIC_ITS,
 #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
+	KVM_DEV_TYPE_XIVE,
+#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
 	KVM_DEV_TYPE_MAX,
 };
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index bd1a677dd9e4..de7eed191107 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -1039,7 +1039,10 @@  static int kvmppc_book3s_init(void)
 #ifdef CONFIG_KVM_XIVE
 	if (xive_enabled()) {
 		kvmppc_xive_init_module();
+		kvmppc_xive_native_init_module();
 		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
+		kvm_register_device_ops(&kvm_xive_native_ops,
+					KVM_DEV_TYPE_XIVE);
 	} else
 #endif
 		kvm_register_device_ops(&kvm_xics_ops, KVM_DEV_TYPE_XICS);
@@ -1050,8 +1053,10 @@  static int kvmppc_book3s_init(void)
 static void kvmppc_book3s_exit(void)
 {
 #ifdef CONFIG_KVM_XICS
-	if (xive_enabled())
+	if (xive_enabled()) {
 		kvmppc_xive_exit_module();
+		kvmppc_xive_native_exit_module();
+	}
 #endif
 #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
 	kvmppc_book3s_exit_pr();
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
new file mode 100644
index 000000000000..115143e76c45
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -0,0 +1,332 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2017-2019, IBM Corporation.
+ */
+
+#define pr_fmt(fmt) "xive-kvm: " fmt
+
+#include <linux/anon_inodes.h>
+#include <linux/kernel.h>
+#include <linux/kvm_host.h>
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/spinlock.h>
+#include <linux/delay.h>
+#include <linux/percpu.h>
+#include <linux/cpumask.h>
+#include <asm/uaccess.h>
+#include <asm/kvm_book3s.h>
+#include <asm/kvm_ppc.h>
+#include <asm/hvcall.h>
+#include <asm/xics.h>
+#include <asm/xive.h>
+#include <asm/xive-regs.h>
+#include <asm/debug.h>
+#include <asm/debugfs.h>
+#include <asm/time.h>
+#include <asm/opal.h>
+
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+#include "book3s_xive.h"
+
+static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
+{
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+	struct xive_q *q = &xc->queues[prio];
+
+	xive_native_disable_queue(xc->vp_id, q, prio);
+	if (q->qpage) {
+		put_page(virt_to_page(q->qpage));
+		q->qpage = NULL;
+	}
+}
+
+void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
+{
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+	int i;
+
+	if (!kvmppc_xive_enabled(vcpu))
+		return;
+
+	if (!xc)
+		return;
+
+	pr_devel("native_cleanup_vcpu(cpu=%d)\n", xc->server_num);
+
+	/* Ensure no interrupt is still routed to that VP */
+	xc->valid = false;
+	kvmppc_xive_disable_vcpu_interrupts(vcpu);
+
+	/* Disable the VP */
+	xive_native_disable_vp(xc->vp_id);
+
+	/* Free the queues & associated interrupts */
+	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
+		/* Free the escalation irq */
+		if (xc->esc_virq[i]) {
+			free_irq(xc->esc_virq[i], vcpu);
+			irq_dispose_mapping(xc->esc_virq[i]);
+			kfree(xc->esc_virq_names[i]);
+			xc->esc_virq[i] = 0;
+		}
+
+		/* Free the queue */
+		xive_native_cleanup_queue(vcpu, i);
+	}
+
+	/* Free the VP */
+	kfree(xc);
+
+	/* Cleanup the vcpu */
+	vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
+	vcpu->arch.xive_vcpu = NULL;
+}
+
+int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+				    struct kvm_vcpu *vcpu, u32 cpu)
+{
+	struct kvmppc_xive *xive = dev->private;
+	struct kvmppc_xive_vcpu *xc;
+	int rc;
+
+	pr_devel("native_connect_vcpu(cpu=%d)\n", cpu);
+
+	if (dev->ops != &kvm_xive_native_ops) {
+		pr_devel("Wrong ops !\n");
+		return -EPERM;
+	}
+	if (xive->kvm != vcpu->kvm)
+		return -EPERM;
+	if (vcpu->arch.irq_type)
+		return -EBUSY;
+	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
+		pr_devel("Duplicate !\n");
+		return -EEXIST;
+	}
+	if (cpu >= KVM_MAX_VCPUS) {
+		pr_devel("Out of bounds !\n");
+		return -EINVAL;
+	}
+	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
+	if (!xc)
+		return -ENOMEM;
+
+	mutex_lock(&vcpu->kvm->lock);
+	vcpu->arch.xive_vcpu = xc;
+	xc->xive = xive;
+	xc->vcpu = vcpu;
+	xc->server_num = cpu;
+	xc->vp_id = xive->vp_base + cpu;
+	xc->valid = true;
+
+	rc = xive_native_get_vp_info(xc->vp_id, &xc->vp_cam, &xc->vp_chip_id);
+	if (rc) {
+		pr_err("Failed to get VP info from OPAL: %d\n", rc);
+		goto bail;
+	}
+
+	/*
+	 * Enable the VP first as the single escalation mode will
+	 * affect escalation interrupts numbering
+	 */
+	rc = xive_native_enable_vp(xc->vp_id, xive->single_escalation);
+	if (rc) {
+		pr_err("Failed to enable VP in OPAL: %d\n", rc);
+		goto bail;
+	}
+
+	/* Configure VCPU fields for use by assembly push/pull */
+	vcpu->arch.xive_saved_state.w01 = cpu_to_be64(0xff000000);
+	vcpu->arch.xive_cam_word = cpu_to_be32(xc->vp_cam | TM_QW1W2_VO);
+
+	/* TODO: initialize queues ? */
+
+bail:
+	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
+	mutex_unlock(&vcpu->kvm->lock);
+	if (rc)
+		kvmppc_xive_native_cleanup_vcpu(vcpu);
+
+	return rc;
+}
+
+static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
+				       struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
+				       struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
+				       struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+static void kvmppc_xive_native_free(struct kvm_device *dev)
+{
+	struct kvmppc_xive *xive = dev->private;
+	struct kvm *kvm = xive->kvm;
+	int i;
+
+	debugfs_remove(xive->dentry);
+
+	pr_devel("Destroying xive native for partition\n");
+
+	if (kvm)
+		kvm->arch.xive = NULL;
+
+	/* Mask and free interrupts */
+	for (i = 0; i <= xive->max_sbid; i++) {
+		if (xive->src_blocks[i])
+			kvmppc_xive_free_sources(xive->src_blocks[i]);
+		kfree(xive->src_blocks[i]);
+		xive->src_blocks[i] = NULL;
+	}
+
+	if (xive->vp_base != XIVE_INVALID_VP)
+		xive_native_free_vp_block(xive->vp_base);
+
+	kfree(xive);
+	kfree(dev);
+}
+
+static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
+{
+	struct kvmppc_xive *xive;
+	struct kvm *kvm = dev->kvm;
+	int ret = 0;
+
+	pr_devel("Creating xive native for partition\n");
+
+	if (kvm->arch.xive)
+		return -EEXIST;
+
+	xive = kzalloc(sizeof(*xive), GFP_KERNEL);
+	if (!xive)
+		return -ENOMEM;
+
+	dev->private = xive;
+	xive->dev = dev;
+	xive->kvm = kvm;
+	kvm->arch.xive = xive;
+
+	/* We use the default queue size set by the host */
+	xive->q_order = xive_native_default_eq_shift();
+	if (xive->q_order < PAGE_SHIFT)
+		xive->q_page_order = 0;
+	else
+		xive->q_page_order = xive->q_order - PAGE_SHIFT;
+
+	/* Allocate a bunch of VPs */
+	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
+	pr_devel("VP_Base=%x\n", xive->vp_base);
+
+	if (xive->vp_base == XIVE_INVALID_VP)
+		ret = -ENOMEM;
+
+	xive->single_escalation = xive_native_has_single_escalation();
+
+	if (ret)
+		kfree(xive);
+
+	return ret;
+}
+
+static int xive_native_debug_show(struct seq_file *m, void *private)
+{
+	struct kvmppc_xive *xive = m->private;
+	struct kvm *kvm = xive->kvm;
+	struct kvm_vcpu *vcpu;
+	unsigned int i;
+
+	if (!kvm)
+		return 0;
+
+	seq_puts(m, "=========\nVCPU state\n=========\n");
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+
+		if (!xc)
+			continue;
+
+		seq_printf(m, "cpu server %#x NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x\n",
+			   xc->server_num,
+			   vcpu->arch.xive_saved_state.nsr,
+			   vcpu->arch.xive_saved_state.cppr,
+			   vcpu->arch.xive_saved_state.ipb,
+			   vcpu->arch.xive_saved_state.pipr,
+			   vcpu->arch.xive_saved_state.w01,
+			   (u32) vcpu->arch.xive_cam_word);
+
+		kvmppc_xive_debug_show_queues(m, vcpu);
+	}
+
+	return 0;
+}
+
+static int xive_native_debug_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, xive_native_debug_show, inode->i_private);
+}
+
+static const struct file_operations xive_native_debug_fops = {
+	.open = xive_native_debug_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static void xive_native_debugfs_init(struct kvmppc_xive *xive)
+{
+	char *name;
+
+	name = kasprintf(GFP_KERNEL, "kvm-xive-%p", xive);
+	if (!name) {
+		pr_err("%s: no memory for name\n", __func__);
+		return;
+	}
+
+	xive->dentry = debugfs_create_file(name, 0444, powerpc_debugfs_root,
+					   xive, &xive_native_debug_fops);
+
+	pr_debug("%s: created %s\n", __func__, name);
+	kfree(name);
+}
+
+static void kvmppc_xive_native_init(struct kvm_device *dev)
+{
+	struct kvmppc_xive *xive = (struct kvmppc_xive *)dev->private;
+
+	/* Register some debug interfaces */
+	xive_native_debugfs_init(xive);
+}
+
+struct kvm_device_ops kvm_xive_native_ops = {
+	.name = "kvm-xive-native",
+	.create = kvmppc_xive_native_create,
+	.init = kvmppc_xive_native_init,
+	.destroy = kvmppc_xive_native_free,
+	.set_attr = kvmppc_xive_native_set_attr,
+	.get_attr = kvmppc_xive_native_get_attr,
+	.has_attr = kvmppc_xive_native_has_attr,
+};
+
+void kvmppc_xive_native_init_module(void)
+{
+	;
+}
+
+void kvmppc_xive_native_exit_module(void)
+{
+	;
+}
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b90a7d154180..01d526e15e9d 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -566,6 +566,9 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_PPC_ENABLE_HCALL:
 #ifdef CONFIG_KVM_XICS
 	case KVM_CAP_IRQ_XICS:
+#endif
+#ifdef CONFIG_KVM_XIVE
+	case KVM_CAP_PPC_IRQ_XIVE:
 #endif
 	case KVM_CAP_PPC_GET_CPU_CHAR:
 		r = 1;
@@ -753,6 +756,9 @@  void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 		else
 			kvmppc_xics_free_icp(vcpu);
 		break;
+	case KVMPPC_IRQ_XIVE:
+		kvmppc_xive_native_cleanup_vcpu(vcpu);
+		break;
 	}
 
 	kvmppc_core_vcpu_free(vcpu);
@@ -1941,6 +1947,30 @@  static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	}
 #endif /* CONFIG_KVM_XICS */
+#ifdef CONFIG_KVM_XIVE
+	case KVM_CAP_PPC_IRQ_XIVE: {
+		struct fd f;
+		struct kvm_device *dev;
+
+		r = -EBADF;
+		f = fdget(cap->args[0]);
+		if (!f.file)
+			break;
+
+		r = -ENXIO;
+		if (!xive_enabled())
+			break;
+
+		r = -EPERM;
+		dev = kvm_device_from_filp(f.file);
+		if (dev)
+			r = kvmppc_xive_native_connect_vcpu(dev, vcpu,
+							    cap->args[1]);
+
+		fdput(f);
+		break;
+	}
+#endif /* CONFIG_KVM_XIVE */
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 	case KVM_CAP_PPC_FWNMI:
 		r = -EINVAL;
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 64f1135e7732..806cbe488410 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -99,7 +99,7 @@  endif
 kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
 	book3s_xics.o
 
-kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o
+kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o book3s_xive_native.o
 kvm-book3s_64-objs-$(CONFIG_SPAPR_TCE_IOMMU) += book3s_64_vio.o
 
 kvm-book3s_64-module-objs := \