[v2,1/3] q35: set split kernel irqchip as default

Message ID	20181220054037.24320-2-peterx@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Peter Xu <peterx@redhat.com> To: qemu-devel@nongnu.org Date: Thu, 20 Dec 2018 13:40:35 +0800 Message-Id: <20181220054037.24320-2-peterx@redhat.com> In-Reply-To: <20181220054037.24320-1-peterx@redhat.com> References: <20181220054037.24320-1-peterx@redhat.com> Subject: [Qemu-devel] [PATCH v2 1/3] q35: set split kernel irqchip as default Precedence: list Cc: Igor Mammedov <imammedo@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Eduardo Habkost <ehabkost@redhat.com>, peterx@redhat.com, "Michael S . Tsirkin" <mst@redhat.com> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	q35: change defaults for kernel irqchip and IR \| expand [v2,0/3] q35: change defaults for kernel irqchip and IR [v2,1/3] q35: set split kernel irqchip as default [v2,2/3] x86-iommu: switch intr_supported to OnOffAuto type [v2,3/3] x86-iommu: turn on IR by default if proper

Peter Xu Dec. 20, 2018, 5:40 a.m. UTC

Starting from QEMU 4.0, let's specify "split" as the default value for
kernel-irqchip.

So for QEMU>=4.0 we'll have: allowed=Y,required=N,split=Y
   for QEMU<=3.1 we'll have: allowed=Y,required=N,split=N
   (omitting all the "kernel_irqchip_" prefix)

Note that this will let the default q35 machine type to depend on
Linux version 4.4 or newer because that's where split irqchip is
introduced in kernel.  But it's fine since we're boosting supported
Linux version for QEMU 4.0 to around Linux 4.5.  For more information
please refer to the discussion on AMD's RDTSCP:

  https://lore.kernel.org/lkml/20181210181328.GA762@zn.tnic/

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/core/machine.c   | 2 ++
 hw/i386/pc_q35.c    | 2 ++
 include/hw/boards.h | 1 +
 3 files changed, 5 insertions(+)

Eduardo Habkost Dec. 20, 2018, 12:13 p.m. UTC | #1

On Thu, Dec 20, 2018 at 01:40:35PM +0800, Peter Xu wrote:
> Starting from QEMU 4.0, let's specify "split" as the default value for
> kernel-irqchip.
> 
> So for QEMU>=4.0 we'll have: allowed=Y,required=N,split=Y
>    for QEMU<=3.1 we'll have: allowed=Y,required=N,split=N
>    (omitting all the "kernel_irqchip_" prefix)
> 
> Note that this will let the default q35 machine type to depend on
> Linux version 4.4 or newer because that's where split irqchip is
> introduced in kernel.  But it's fine since we're boosting supported
> Linux version for QEMU 4.0 to around Linux 4.5.  For more information
> please refer to the discussion on AMD's RDTSCP:
> 
>   https://lore.kernel.org/lkml/20181210181328.GA762@zn.tnic/
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>

Alex Williamson April 26, 2019, 7:27 p.m. UTC | #2

On Thu, 20 Dec 2018 13:40:35 +0800
Peter Xu <peterx@redhat.com> wrote:

> Starting from QEMU 4.0, let's specify "split" as the default value for
> kernel-irqchip.
> 
> So for QEMU>=4.0 we'll have: allowed=Y,required=N,split=Y
>    for QEMU<=3.1 we'll have: allowed=Y,required=N,split=N
>    (omitting all the "kernel_irqchip_" prefix)
> 
> Note that this will let the default q35 machine type to depend on
> Linux version 4.4 or newer because that's where split irqchip is
> introduced in kernel.  But it's fine since we're boosting supported
> Linux version for QEMU 4.0 to around Linux 4.5.  For more information
> please refer to the discussion on AMD's RDTSCP:
> 
>   https://lore.kernel.org/lkml/20181210181328.GA762@zn.tnic/

It looks like this broke INTx for vfio-pci, see:

https://bugs.launchpad.net/qemu/+bug/1826422

In my testing it looks like KVM advertises supporting the KVM_IRQFD
resample feature, but vfio never gets the unmask notification, so the
device remains with DisINTx set and no further interrupts are
generated.  Do we expect KVM's IRQFD with resampler to work in the
split IRQ mode?  We can certainly hope that "high performance" devices
use MSI or MSI/X, but this would be quite a performance regression with
split mode if our userspace bypass for INTx goes away.  Thanks,

Alex

PS - the least impact workaround for users is to re-enable kernel mode
irqchip with kernel_irqchip=on or <ioapic driver='kvm'/>.  We can also
force QEMU routing of INTx with the x-no-kvm-intx=on option per
vfio-pci device, but this disables the userspace bypass and brings
along higher interrupt latency and overhead.

Alex Williamson April 26, 2019, 9:02 p.m. UTC | #3

On Fri, 26 Apr 2019 13:27:44 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 20 Dec 2018 13:40:35 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > Starting from QEMU 4.0, let's specify "split" as the default value for
> > kernel-irqchip.
> > 
> > So for QEMU>=4.0 we'll have: allowed=Y,required=N,split=Y
> >    for QEMU<=3.1 we'll have: allowed=Y,required=N,split=N
> >    (omitting all the "kernel_irqchip_" prefix)
> > 
> > Note that this will let the default q35 machine type to depend on
> > Linux version 4.4 or newer because that's where split irqchip is
> > introduced in kernel.  But it's fine since we're boosting supported
> > Linux version for QEMU 4.0 to around Linux 4.5.  For more information
> > please refer to the discussion on AMD's RDTSCP:
> > 
> >   https://lore.kernel.org/lkml/20181210181328.GA762@zn.tnic/  
> 
> It looks like this broke INTx for vfio-pci, see:
> 
> https://bugs.launchpad.net/qemu/+bug/1826422
> 
> In my testing it looks like KVM advertises supporting the KVM_IRQFD
> resample feature, but vfio never gets the unmask notification, so the
> device remains with DisINTx set and no further interrupts are
> generated.  Do we expect KVM's IRQFD with resampler to work in the
> split IRQ mode?  We can certainly hope that "high performance" devices
> use MSI or MSI/X, but this would be quite a performance regression with
> split mode if our userspace bypass for INTx goes away.  Thanks,

arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before
kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via
kvm_notify_acked_gsi(), so it looks like KVM really ought to return an
error when trying to register a resample IRQFD when irqchip_split(),
but do we have better options?  EOI handling in QEMU is pretty much
non-existent and even if it was in place, it's a big performance
regression for vfio INTx handling.  Is a split irqchip that retains
performant resampling (EOI) support possible?  Thanks,

Alex

Paolo Bonzini April 27, 2019, 5:29 a.m. UTC | #4

> > In my testing it looks like KVM advertises supporting the KVM_IRQFD
> > resample feature, but vfio never gets the unmask notification, so the
> > device remains with DisINTx set and no further interrupts are
> > generated.  Do we expect KVM's IRQFD with resampler to work in the
> > split IRQ mode?  We can certainly hope that "high performance" devices
> > use MSI or MSI/X, but this would be quite a performance regression with
> > split mode if our userspace bypass for INTx goes away.  Thanks,
> 
> arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before
> kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via
> kvm_notify_acked_gsi(),

That wouldn't help because kvm_ioapic_update_eoi would not even be
able to access vcpu->kvm->arch.vioapic (it's NULL).

The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi,
before requesting the exit to userspace.  However I am not sure how QEMU
sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to
the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but
only the in-kernel LAPIC.  That would be incorrect, and if my understanding is
correct we need to trigger resampling from hw/intc/ioapic.c.

Something like:

1) VFIO uses kvm_irqchip_add_irqfd_notifier_gsi instead of KVM_IRQFD directly

2) add a NotifierList in kvm_irqchip_assign_irqfd

3) if kvm_irqchip_is_split(), register a corresponding Notifier in ioapic.c which
stores the resamplefd as an EventNotifier*

4) ioapic_eoi_broadcast writes to the resamplefd

BTW, how do non-x86 platforms handle intx resampling?

Paolo

---- 8< -----
From: Paolo Bonzini <pbonzini@redhat.com>
Subject: [PATCH WIP] kvm: lapic: run ack notifiers for userspace IOAPIC routes

diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
index ea1a4e0297da..be337e06e3fd 100644
--- a/arch/x86/kvm/ioapic.h
+++ b/arch/x86/kvm/ioapic.h
@@ -135,4 +135,6 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu,
 			   ulong *ioapic_handled_vectors);
 void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu,
 			    ulong *ioapic_handled_vectors);
+void kvm_notify_userspace_ioapic_routes(struct kvm *kvm, int vector);
+
 #endif
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 3cc3b2d130a0..46ea79a0dd8a 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -435,6 +435,34 @@ void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu,
 	srcu_read_unlock(&kvm->irq_srcu, idx);
 }
 
+void kvm_notify_userspace_ioapic_routes(struct kvm *kvm, int vector)
+{
+        struct kvm_kernel_irq_routing_entry *entry;
+        struct kvm_irq_routing_table *table;
+        u32 i, nr_ioapic_pins;
+        int idx;
+
+        idx = srcu_read_lock(&kvm->irq_srcu);
+        table = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
+        nr_ioapic_pins = min_t(u32, table->nr_rt_entries,
+                               kvm->arch.nr_reserved_ioapic_pins);
+
+	for (i = 0; i < nr_ioapic_pins; i++) {
+                hlist_for_each_entry(entry, &table->map[i], link) {
+                        struct kvm_lapic_irq irq;
+
+                        if (entry->type != KVM_IRQ_ROUTING_MSI)
+                                continue;
+
+			kvm_set_msi_irq(kvm, entry, &irq);
+			if (irq.level && irq.vector == vector)
+				kvm_notify_acked_gsi(kvm, i);
+		}
+	}
+
+	srcu_read_unlock(&kvm->irq_srcu, idx);
+}
+
 void kvm_arch_irq_routing_update(struct kvm *kvm)
 {
 	kvm_hv_irq_routing_update(kvm);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9f089e2e09d0..8f8e76d77925 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1142,6 +1142,7 @@ static bool kvm_ioapic_handles_vector(struct kvm_lapic *apic, int vector)
 
 static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
 {
+	struct kvm_vcpu *vcpu = apic->vcpu;
 	int trigger_mode;
 
 	/* Eoi the ioapic only if the ioapic doesn't own the vector. */
@@ -1149,9 +1150,10 @@ static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
 		return;
 
 	/* Request a KVM exit to inform the userspace IOAPIC. */
-	if (irqchip_split(apic->vcpu->kvm)) {
-		apic->vcpu->arch.pending_ioapic_eoi = vector;
-		kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT, apic->vcpu);
+	if (irqchip_split(vcpu->kvm)) {
+		kvm_notify_userspace_ioapic_routes(vcpu->kvm, vector);
+		vcpu->arch.pending_ioapic_eoi = vector;
+		kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT, vcpu);
 		return;
 	}
 
@@ -1160,7 +1162,7 @@ static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
 	else
 		trigger_mode = IOAPIC_EDGE_TRIG;
 
-	kvm_ioapic_update_eoi(apic->vcpu, vector, trigger_mode);
+	kvm_ioapic_update_eoi(vcpu, vector, trigger_mode);
 }
 
 static int apic_set_eoi(struct kvm_lapic *apic)


> so it looks like KVM really ought to return an
> error when trying to register a resample IRQFD when irqchip_split(),
> but do we have better options?  EOI handling in QEMU is pretty much
> non-existent and even if it was in place, it's a big performance
> regression for vfio INTx handling.  Is a split irqchip that retains
> performant resampling (EOI) support possible?  Thanks,

Paolo Bonzini April 27, 2019, 8:09 a.m. UTC | #5

On 27/04/19 07:29, Paolo Bonzini wrote:
> 
>>> In my testing it looks like KVM advertises supporting the KVM_IRQFD
>>> resample feature, but vfio never gets the unmask notification, so the
>>> device remains with DisINTx set and no further interrupts are
>>> generated.  Do we expect KVM's IRQFD with resampler to work in the
>>> split IRQ mode?  We can certainly hope that "high performance" devices
>>> use MSI or MSI/X, but this would be quite a performance regression with
>>> split mode if our userspace bypass for INTx goes away.  Thanks,
>>
>> arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before
>> kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via
>> kvm_notify_acked_gsi(),
> 
> That wouldn't help because kvm_ioapic_update_eoi would not even be
> able to access vcpu->kvm->arch.vioapic (it's NULL).
> 
> The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi,
> before requesting the exit to userspace.  However I am not sure how QEMU
> sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to
> the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but
> only the in-kernel LAPIC.  That would be incorrect, and if my understanding is
> correct we need to trigger resampling from hw/intc/ioapic.c.

Actually it's worse: because you're bypassing IOAPIC when raising the
irq, the IOAPIC's remote_irr for example will not be set.  So split
irqchip currently must disable the intx fast path completely.

I guess we could also reimplement irqfd and resamplefd in the userspace
IOAPIC, and run the listener in a separate thread (using "-object
iothread" on the command line and AioContext in the code).

Paolo

> 
> Something like:
> 
> 1) VFIO uses kvm_irqchip_add_irqfd_notifier_gsi instead of KVM_IRQFD directly
> 
> 2) add a NotifierList in kvm_irqchip_assign_irqfd
> 
> 3) if kvm_irqchip_is_split(), register a corresponding Notifier in ioapic.c which
> stores the resamplefd as an EventNotifier*
> 
> 4) ioapic_eoi_broadcast writes to the resamplefd
> 
> BTW, how do non-x86 platforms handle intx resampling?
> 
> Paolo
> 
> ---- 8< -----
> From: Paolo Bonzini <pbonzini@redhat.com>
> Subject: [PATCH WIP] kvm: lapic: run ack notifiers for userspace IOAPIC routes
> 
> diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
> index ea1a4e0297da..be337e06e3fd 100644
> --- a/arch/x86/kvm/ioapic.h
> +++ b/arch/x86/kvm/ioapic.h
> @@ -135,4 +135,6 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu,
>  			   ulong *ioapic_handled_vectors);
>  void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu,
>  			    ulong *ioapic_handled_vectors);
> +void kvm_notify_userspace_ioapic_routes(struct kvm *kvm, int vector);
> +
>  #endif
> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> index 3cc3b2d130a0..46ea79a0dd8a 100644
> --- a/arch/x86/kvm/irq_comm.c
> +++ b/arch/x86/kvm/irq_comm.c
> @@ -435,6 +435,34 @@ void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu,
>  	srcu_read_unlock(&kvm->irq_srcu, idx);
>  }
>  
> +void kvm_notify_userspace_ioapic_routes(struct kvm *kvm, int vector)
> +{
> +        struct kvm_kernel_irq_routing_entry *entry;
> +        struct kvm_irq_routing_table *table;
> +        u32 i, nr_ioapic_pins;
> +        int idx;
> +
> +        idx = srcu_read_lock(&kvm->irq_srcu);
> +        table = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
> +        nr_ioapic_pins = min_t(u32, table->nr_rt_entries,
> +                               kvm->arch.nr_reserved_ioapic_pins);
> +
> +	for (i = 0; i < nr_ioapic_pins; i++) {
> +                hlist_for_each_entry(entry, &table->map[i], link) {
> +                        struct kvm_lapic_irq irq;
> +
> +                        if (entry->type != KVM_IRQ_ROUTING_MSI)
> +                                continue;
> +
> +			kvm_set_msi_irq(kvm, entry, &irq);
> +			if (irq.level && irq.vector == vector)
> +				kvm_notify_acked_gsi(kvm, i);
> +		}
> +	}
> +
> +	srcu_read_unlock(&kvm->irq_srcu, idx);
> +}
> +
>  void kvm_arch_irq_routing_update(struct kvm *kvm)
>  {
>  	kvm_hv_irq_routing_update(kvm);
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 9f089e2e09d0..8f8e76d77925 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1142,6 +1142,7 @@ static bool kvm_ioapic_handles_vector(struct kvm_lapic *apic, int vector)
>  
>  static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
>  {
> +	struct kvm_vcpu *vcpu = apic->vcpu;
>  	int trigger_mode;
>  
>  	/* Eoi the ioapic only if the ioapic doesn't own the vector. */
> @@ -1149,9 +1150,10 @@ static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
>  		return;
>  
>  	/* Request a KVM exit to inform the userspace IOAPIC. */
> -	if (irqchip_split(apic->vcpu->kvm)) {
> -		apic->vcpu->arch.pending_ioapic_eoi = vector;
> -		kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT, apic->vcpu);
> +	if (irqchip_split(vcpu->kvm)) {
> +		kvm_notify_userspace_ioapic_routes(vcpu->kvm, vector);
> +		vcpu->arch.pending_ioapic_eoi = vector;
> +		kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT, vcpu);
>  		return;
>  	}
>  
> @@ -1160,7 +1162,7 @@ static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
>  	else
>  		trigger_mode = IOAPIC_EDGE_TRIG;
>  
> -	kvm_ioapic_update_eoi(apic->vcpu, vector, trigger_mode);
> +	kvm_ioapic_update_eoi(vcpu, vector, trigger_mode);
>  }
>  
>  static int apic_set_eoi(struct kvm_lapic *apic)
> 
> 
>> so it looks like KVM really ought to return an
>> error when trying to register a resample IRQFD when irqchip_split(),
>> but do we have better options?  EOI handling in QEMU is pretty much
>> non-existent and even if it was in place, it's a big performance
>> regression for vfio INTx handling.  Is a split irqchip that retains
>> performant resampling (EOI) support possible?  Thanks,

Alex Williamson April 29, 2019, 2:21 p.m. UTC | #6

On Sat, 27 Apr 2019 10:09:51 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 27/04/19 07:29, Paolo Bonzini wrote:
> >   
> >>> In my testing it looks like KVM advertises supporting the KVM_IRQFD
> >>> resample feature, but vfio never gets the unmask notification, so the
> >>> device remains with DisINTx set and no further interrupts are
> >>> generated.  Do we expect KVM's IRQFD with resampler to work in the
> >>> split IRQ mode?  We can certainly hope that "high performance" devices
> >>> use MSI or MSI/X, but this would be quite a performance regression with
> >>> split mode if our userspace bypass for INTx goes away.  Thanks,  
> >>
> >> arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before
> >> kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via
> >> kvm_notify_acked_gsi(),  
> > 
> > That wouldn't help because kvm_ioapic_update_eoi would not even be
> > able to access vcpu->kvm->arch.vioapic (it's NULL).
> > 
> > The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi,
> > before requesting the exit to userspace.  However I am not sure how QEMU
> > sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to
> > the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but
> > only the in-kernel LAPIC.  That would be incorrect, and if my understanding is
> > correct we need to trigger resampling from hw/intc/ioapic.c.  
> 
> Actually it's worse: because you're bypassing IOAPIC when raising the
> irq, the IOAPIC's remote_irr for example will not be set.  So split
> irqchip currently must disable the intx fast path completely.
> 
> I guess we could also reimplement irqfd and resamplefd in the userspace
> IOAPIC, and run the listener in a separate thread (using "-object
> iothread" on the command line and AioContext in the code).

This sounds like a performance regression vs KVM irqchip any way we
slice it.  Was this change a mistake?  Without KVM support, the
universal support in QEMU kicks in, where device mmaps are disabled
when an INTx occurs, forcing trapped access to the device, and we
assume that the next access is in response to an interrupt and trigger
our own internal EOI and re-enable mmaps.  A timer acts as a
catch-all.  Needless to say, this is functional but not fast.  It would
be a massive performance regression for devices depending on INTx and
previously using the KVM bypass to switch to this.  INTx is largely
considered a legacy interrupt, so non-x86 archs don't encounter it as
often, S390 even explicitly disables INTx support.  ARM and POWER
likely just don't see a lot of these devices, but nearly all devices
(except SR-IOV VFs) on x86 expect an INTx fallback mode and some
drivers may run the device in INTx for compatibility.  This split
irqchip change was likely fine for "enterprise" users concerned only
with modern high speed devices, but very much not for device assignment
used for compatibility use cases or commodity hardware users.

What's a good 4.0.1 strategy to resolve this?  Re-instate KVM irqchip
as the Q35 default?  I can't see that simply switching to current QEMU
handling is a viable option for performance?  What about 4.1?  We could
certainly improve EOI support in QEMU, there's essentially no support
currently, but it seems like an uphill battle for an iothread based
userspace ioapic to ever compare to KVM handling?  Thanks,

Alex

Eduardo Habkost April 29, 2019, 2:55 p.m. UTC | #7

On Mon, Apr 29, 2019 at 08:21:06AM -0600, Alex Williamson wrote:
> On Sat, 27 Apr 2019 10:09:51 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> > On 27/04/19 07:29, Paolo Bonzini wrote:
> > >   
> > >>> In my testing it looks like KVM advertises supporting the KVM_IRQFD
> > >>> resample feature, but vfio never gets the unmask notification, so the
> > >>> device remains with DisINTx set and no further interrupts are
> > >>> generated.  Do we expect KVM's IRQFD with resampler to work in the
> > >>> split IRQ mode?  We can certainly hope that "high performance" devices
> > >>> use MSI or MSI/X, but this would be quite a performance regression with
> > >>> split mode if our userspace bypass for INTx goes away.  Thanks,  
> > >>
> > >> arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before
> > >> kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via
> > >> kvm_notify_acked_gsi(),  
> > > 
> > > That wouldn't help because kvm_ioapic_update_eoi would not even be
> > > able to access vcpu->kvm->arch.vioapic (it's NULL).
> > > 
> > > The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi,
> > > before requesting the exit to userspace.  However I am not sure how QEMU
> > > sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to
> > > the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but
> > > only the in-kernel LAPIC.  That would be incorrect, and if my understanding is
> > > correct we need to trigger resampling from hw/intc/ioapic.c.  
> > 
> > Actually it's worse: because you're bypassing IOAPIC when raising the
> > irq, the IOAPIC's remote_irr for example will not be set.  So split
> > irqchip currently must disable the intx fast path completely.
> > 
> > I guess we could also reimplement irqfd and resamplefd in the userspace
> > IOAPIC, and run the listener in a separate thread (using "-object
> > iothread" on the command line and AioContext in the code).
> 
> This sounds like a performance regression vs KVM irqchip any way we
> slice it.  Was this change a mistake?  Without KVM support, the
> universal support in QEMU kicks in, where device mmaps are disabled
> when an INTx occurs, forcing trapped access to the device, and we
> assume that the next access is in response to an interrupt and trigger
> our own internal EOI and re-enable mmaps.  A timer acts as a
> catch-all.  Needless to say, this is functional but not fast.  It would
> be a massive performance regression for devices depending on INTx and
> previously using the KVM bypass to switch to this.  INTx is largely
> considered a legacy interrupt, so non-x86 archs don't encounter it as
> often, S390 even explicitly disables INTx support.  ARM and POWER
> likely just don't see a lot of these devices, but nearly all devices
> (except SR-IOV VFs) on x86 expect an INTx fallback mode and some
> drivers may run the device in INTx for compatibility.  This split
> irqchip change was likely fine for "enterprise" users concerned only
> with modern high speed devices, but very much not for device assignment
> used for compatibility use cases or commodity hardware users.
> 
> What's a good 4.0.1 strategy to resolve this?  Re-instate KVM irqchip
> as the Q35 default?  I can't see that simply switching to current QEMU
> handling is a viable option for performance?  What about 4.1?  We could
> certainly improve EOI support in QEMU, there's essentially no support
> currently, but it seems like an uphill battle for an iothread based
> userspace ioapic to ever compare to KVM handling?  Thanks,

irqchip=split and irqchip=kernel aren't guest ABI compatible, are
they?  That would make it impossible to fix this in pc-q35-4.0
for a 4.0.1 update.

Alex Williamson April 29, 2019, 3:22 p.m. UTC | #8

On Mon, 29 Apr 2019 11:55:56 -0300
Eduardo Habkost <ehabkost@redhat.com> wrote:

> On Mon, Apr 29, 2019 at 08:21:06AM -0600, Alex Williamson wrote:
> > On Sat, 27 Apr 2019 10:09:51 +0200
> > Paolo Bonzini <pbonzini@redhat.com> wrote:
> >   
> > > On 27/04/19 07:29, Paolo Bonzini wrote:  
> > > >     
> > > >>> In my testing it looks like KVM advertises supporting the KVM_IRQFD
> > > >>> resample feature, but vfio never gets the unmask notification, so the
> > > >>> device remains with DisINTx set and no further interrupts are
> > > >>> generated.  Do we expect KVM's IRQFD with resampler to work in the
> > > >>> split IRQ mode?  We can certainly hope that "high performance" devices
> > > >>> use MSI or MSI/X, but this would be quite a performance regression with
> > > >>> split mode if our userspace bypass for INTx goes away.  Thanks,    
> > > >>
> > > >> arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before
> > > >> kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via
> > > >> kvm_notify_acked_gsi(),    
> > > > 
> > > > That wouldn't help because kvm_ioapic_update_eoi would not even be
> > > > able to access vcpu->kvm->arch.vioapic (it's NULL).
> > > > 
> > > > The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi,
> > > > before requesting the exit to userspace.  However I am not sure how QEMU
> > > > sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to
> > > > the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but
> > > > only the in-kernel LAPIC.  That would be incorrect, and if my understanding is
> > > > correct we need to trigger resampling from hw/intc/ioapic.c.    
> > > 
> > > Actually it's worse: because you're bypassing IOAPIC when raising the
> > > irq, the IOAPIC's remote_irr for example will not be set.  So split
> > > irqchip currently must disable the intx fast path completely.
> > > 
> > > I guess we could also reimplement irqfd and resamplefd in the userspace
> > > IOAPIC, and run the listener in a separate thread (using "-object
> > > iothread" on the command line and AioContext in the code).  
> > 
> > This sounds like a performance regression vs KVM irqchip any way we
> > slice it.  Was this change a mistake?  Without KVM support, the
> > universal support in QEMU kicks in, where device mmaps are disabled
> > when an INTx occurs, forcing trapped access to the device, and we
> > assume that the next access is in response to an interrupt and trigger
> > our own internal EOI and re-enable mmaps.  A timer acts as a
> > catch-all.  Needless to say, this is functional but not fast.  It would
> > be a massive performance regression for devices depending on INTx and
> > previously using the KVM bypass to switch to this.  INTx is largely
> > considered a legacy interrupt, so non-x86 archs don't encounter it as
> > often, S390 even explicitly disables INTx support.  ARM and POWER
> > likely just don't see a lot of these devices, but nearly all devices
> > (except SR-IOV VFs) on x86 expect an INTx fallback mode and some
> > drivers may run the device in INTx for compatibility.  This split
> > irqchip change was likely fine for "enterprise" users concerned only
> > with modern high speed devices, but very much not for device assignment
> > used for compatibility use cases or commodity hardware users.
> > 
> > What's a good 4.0.1 strategy to resolve this?  Re-instate KVM irqchip
> > as the Q35 default?  I can't see that simply switching to current QEMU
> > handling is a viable option for performance?  What about 4.1?  We could
> > certainly improve EOI support in QEMU, there's essentially no support
> > currently, but it seems like an uphill battle for an iothread based
> > userspace ioapic to ever compare to KVM handling?  Thanks,  
> 
> irqchip=split and irqchip=kernel aren't guest ABI compatible, are
> they?  That would make it impossible to fix this in pc-q35-4.0
> for a 4.0.1 update.

I suppose it would require a pc-q35-4.0.1 machine type :-\  Thanks,

Alex

Alex Williamson April 30, 2019, 11:01 p.m. UTC | #9

On Mon, 29 Apr 2019 08:21:06 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Sat, 27 Apr 2019 10:09:51 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> > On 27/04/19 07:29, Paolo Bonzini wrote:  
> > >     
> > >>> In my testing it looks like KVM advertises supporting the KVM_IRQFD
> > >>> resample feature, but vfio never gets the unmask notification, so the
> > >>> device remains with DisINTx set and no further interrupts are
> > >>> generated.  Do we expect KVM's IRQFD with resampler to work in the
> > >>> split IRQ mode?  We can certainly hope that "high performance" devices
> > >>> use MSI or MSI/X, but this would be quite a performance regression with
> > >>> split mode if our userspace bypass for INTx goes away.  Thanks,    
> > >>
> > >> arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before
> > >> kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via
> > >> kvm_notify_acked_gsi(),    
> > > 
> > > That wouldn't help because kvm_ioapic_update_eoi would not even be
> > > able to access vcpu->kvm->arch.vioapic (it's NULL).
> > > 
> > > The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi,
> > > before requesting the exit to userspace.  However I am not sure how QEMU
> > > sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to
> > > the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but
> > > only the in-kernel LAPIC.  That would be incorrect, and if my understanding is
> > > correct we need to trigger resampling from hw/intc/ioapic.c.    
> > 
> > Actually it's worse: because you're bypassing IOAPIC when raising the
> > irq, the IOAPIC's remote_irr for example will not be set.  So split
> > irqchip currently must disable the intx fast path completely.
> > 
> > I guess we could also reimplement irqfd and resamplefd in the userspace
> > IOAPIC, and run the listener in a separate thread (using "-object
> > iothread" on the command line and AioContext in the code).  
> 
> This sounds like a performance regression vs KVM irqchip any way we
> slice it.  Was this change a mistake?  Without KVM support, the
> universal support in QEMU kicks in, where device mmaps are disabled
> when an INTx occurs, forcing trapped access to the device, and we
> assume that the next access is in response to an interrupt and trigger
> our own internal EOI and re-enable mmaps.  A timer acts as a
> catch-all.  Needless to say, this is functional but not fast.  It would
> be a massive performance regression for devices depending on INTx and
> previously using the KVM bypass to switch to this.  INTx is largely
> considered a legacy interrupt, so non-x86 archs don't encounter it as
> often, S390 even explicitly disables INTx support.  ARM and POWER
> likely just don't see a lot of these devices, but nearly all devices
> (except SR-IOV VFs) on x86 expect an INTx fallback mode and some
> drivers may run the device in INTx for compatibility.  This split
> irqchip change was likely fine for "enterprise" users concerned only
> with modern high speed devices, but very much not for device assignment
> used for compatibility use cases or commodity hardware users.
> 
> What's a good 4.0.1 strategy to resolve this?  Re-instate KVM irqchip
> as the Q35 default?  I can't see that simply switching to current QEMU
> handling is a viable option for performance?  What about 4.1?  We could
> certainly improve EOI support in QEMU, there's essentially no support
> currently, but it seems like an uphill battle for an iothread based
> userspace ioapic to ever compare to KVM handling?  Thanks,

Poking at this a bit, we can add kvm_irqchip_is_split() to the set of
things we test for in hw/vfio/pci.c:vfio_intx_enable_kvm() to avoid the
KVM INTx bypass when using split IRQ chip.  This at least avoids paths
that cannot work currently.  We'll fall back to vfio's universal EOI
detection of toggling direct mapped MemoryRegions, which is enough for
simple devices like NICs.  However, it's barely functional with an
NVIDIA GeForce card assigned to a Windows VM, it only takes a graphics
test program to send it over the edge and trigger a TDR.  Even with TDR
disabled, the VM will hang.  I also played with ioapic_eoi_broadcast()
calling directly into vfio code to trigger the EOI (without the mmap
toggling), it's even worse, Windows can't even get to the desktop
before it hangs.

So while I was initially impressed that netperf TCP_RR results for a
gigabit NIC were not that different between in-kernel ioapic and split
irqchip, the graphics results have me again wondering why we made this
change and how userspace handling can get us back to a functional level.

The only way I can get the GPU/Windows configuration usable is to
assert the IRQ, immediately de-assert, and unmask the device all from
vfio_intx_interrupt().  An interrupt intensive graphics benchmark runs
at ~80% of KVM irqchip with about 10% more CPU load with this
experiment (but it actually runs!).  Potentially some devices could
mimic INTx using MSI, like legacy KVM device assignment used to do with
this mode, eliminating the unmask ioctl, but even the legacy driver
noted compatibility issues with that mode and neither is a good
reproduction of how INTx is supposed to work.

Any other insights appreciated, and I really would like to understand
what we've gained with split irqchip and whether it's worth this.
Thanks,

Alex

Paolo Bonzini April 30, 2019, 11:09 p.m. UTC | #10

On 01/05/19 01:01, Alex Williamson wrote:
> Poking at this a bit, we can add kvm_irqchip_is_split() to the set of
> things we test for in hw/vfio/pci.c:vfio_intx_enable_kvm() to avoid the
> KVM INTx bypass when using split IRQ chip.

Yes, this should be enough.

> The only way I can get the GPU/Windows configuration usable is to
> assert the IRQ, immediately de-assert, and unmask the device all from
> vfio_intx_interrupt().  An interrupt intensive graphics benchmark runs
> at ~80% of KVM irqchip with about 10% more CPU load with this
> experiment (but it actually runs!).

Nice.  If you can do it directly from hw/vfio there may be no need to do
more changes to the IOAPIC, and least not immediately.  But that is not
a good emulation of INTX, isn't it?  IIUC, it relies on the
level-triggered interrupt never being masked in the IOAPIC.

> Any other insights appreciated, and I really would like to understand
> what we've gained with split irqchip and whether it's worth this.

We've gained guest interrupt remapping support, we are not relying on
newer kernel versions, and the attack surface from the guest to the
hypervisor is smaller.

Paolo

Alex Williamson May 1, 2019, 12:28 a.m. UTC | #11

On Wed, 1 May 2019 01:09:48 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 01/05/19 01:01, Alex Williamson wrote:
> > Poking at this a bit, we can add kvm_irqchip_is_split() to the set of
> > things we test for in hw/vfio/pci.c:vfio_intx_enable_kvm() to avoid the
> > KVM INTx bypass when using split IRQ chip.  
> 
> Yes, this should be enough.
> 
> > The only way I can get the GPU/Windows configuration usable is to
> > assert the IRQ, immediately de-assert, and unmask the device all from
> > vfio_intx_interrupt().  An interrupt intensive graphics benchmark runs
> > at ~80% of KVM irqchip with about 10% more CPU load with this
> > experiment (but it actually runs!).  
> 
> Nice.  If you can do it directly from hw/vfio there may be no need to do
> more changes to the IOAPIC, and least not immediately.  But that is not
> a good emulation of INTX, isn't it?  IIUC, it relies on the
> level-triggered interrupt never being masked in the IOAPIC.

Yeah, that's the problem, well one of the problems in addition to the
lower performance and increased overhead, is that it's a rather poor
emulation of INTx.  It matches pci_irq_pulse(), which has been
discouraged from use.  Anything but kernel irqchip makes me nervous for
INTx, but emulating it with a different physical IRQ type definitely
more so.

> > Any other insights appreciated, and I really would like to understand
> > what we've gained with split irqchip and whether it's worth this.  
> 
> We've gained guest interrupt remapping support, we are not relying on
> newer kernel versions, and the attack surface from the guest to the
> hypervisor is smaller.

Interrupt remapping is really only necessary with high vCPU counts or
maybe nested device assignment, seems doubtful that has a larger user
base.  I did re-watch the video of Steve Rutherford's talk from a
couple years ago, but he does re-iterate that pulling the ioapic from
KVM does have drawbacks for general purpose VMs that I thought would
have precluded it from becoming the default for the base QEMU machine
type.  Getting a GeForce card to run with MSI requires (repeated)
regedit'ing on Windows or module options on Linux.  The audio device
can require the same, otherwise we're probably mostly looking at old
devices assigned for compatibility using INTx, NICs or USB devices
would of course more likely use MSI/X for anything reasonably modern.
Thanks,

Alex

Eduardo Habkost May 3, 2019, 6:42 p.m. UTC | #12

On Mon, Apr 29, 2019 at 09:22:12AM -0600, Alex Williamson wrote:
[...]
> > > What's a good 4.0.1 strategy to resolve this?  Re-instate KVM irqchip
> > > as the Q35 default?  I can't see that simply switching to current QEMU
> > > handling is a viable option for performance?  What about 4.1?  We could
> > > certainly improve EOI support in QEMU, there's essentially no support
> > > currently, but it seems like an uphill battle for an iothread based
> > > userspace ioapic to ever compare to KVM handling?  Thanks,  
> > 
> > irqchip=split and irqchip=kernel aren't guest ABI compatible, are
> > they?  That would make it impossible to fix this in pc-q35-4.0
> > for a 4.0.1 update.
> 
> I suppose it would require a pc-q35-4.0.1 machine type :-\  Thanks,

I wonder if it's possible to untangle this and make the irqchip
option stop affecting guest ABI on 4.1+ machine-types?  This way
QEMU could choose smarter defaults in the future without the
compatibility code hassle.

Michael S. Tsirkin May 3, 2019, 8 p.m. UTC | #13

On Mon, Apr 29, 2019 at 11:55:56AM -0300, Eduardo Habkost wrote:
> irqchip=split and irqchip=kernel aren't guest ABI compatible, are
> they?

Can you remind me why they aren't?

> -- 
> Eduardo

Eduardo Habkost May 3, 2019, 8:23 p.m. UTC | #14

On Fri, May 03, 2019 at 04:00:33PM -0400, Michael S. Tsirkin wrote:
> On Mon, Apr 29, 2019 at 11:55:56AM -0300, Eduardo Habkost wrote:
> > irqchip=split and irqchip=kernel aren't guest ABI compatible, are
> > they?
> 
> Can you remind me why they aren't?

We have the code introduced by patch 3/3 from this series, but I
don't know if it's the only difference:

hw/i386/x86-iommu.c=static void x86_iommu_realize(DeviceState *dev, Error **errp)
[...]
hw/i386/x86-iommu.c:    bool irq_all_kernel = kvm_irqchip_in_kernel() && !kvm_irqchip_is_split();
[...]
hw/i386/x86-iommu.c-    /* If the user didn't specify IR, choose a default value for it */
hw/i386/x86-iommu.c-    if (x86_iommu->intr_supported == ON_OFF_AUTO_AUTO) {
hw/i386/x86-iommu.c-        x86_iommu->intr_supported = irq_all_kernel ?
hw/i386/x86-iommu.c-            ON_OFF_AUTO_OFF : ON_OFF_AUTO_ON;
hw/i386/x86-iommu.c-    }

Peter Xu May 5, 2019, 9:06 a.m. UTC | #15

On Fri, May 03, 2019 at 03:42:06PM -0300, Eduardo Habkost wrote:
> On Mon, Apr 29, 2019 at 09:22:12AM -0600, Alex Williamson wrote:
> [...]
> > > > What's a good 4.0.1 strategy to resolve this?  Re-instate KVM irqchip
> > > > as the Q35 default?  I can't see that simply switching to current QEMU
> > > > handling is a viable option for performance?  What about 4.1?  We could
> > > > certainly improve EOI support in QEMU, there's essentially no support
> > > > currently, but it seems like an uphill battle for an iothread based
> > > > userspace ioapic to ever compare to KVM handling?  Thanks,  
> > > 
> > > irqchip=split and irqchip=kernel aren't guest ABI compatible, are
> > > they?  That would make it impossible to fix this in pc-q35-4.0
> > > for a 4.0.1 update.
> > 
> > I suppose it would require a pc-q35-4.0.1 machine type :-\  Thanks,
> 
> I wonder if it's possible to untangle this and make the irqchip
> option stop affecting guest ABI on 4.1+ machine-types?  This way
> QEMU could choose smarter defaults in the future without the
> compatibility code hassle.

Hi, Eduardo,

Do you mean to enable IOMMU IR for kernel-irqchip=on?  If so, I would
say it's not trivial...  The major issue is that we probably need to
explicitly kick QEMU for every kernel IOAPIC setups since QEMU is the
only one who knows everything about interrupt remapping, while KVM
don't even have such a mechanism so far.

I'm thinking whether we should try to fix the functional problem first
as proposed by Alex?  The problem is even if we switch the default
mode for Q35 the user could still specify that in the command line and
I feel like we'd better fix the functional issue first before we
consider the possible performance regression?  The worst case I
thought of is with the fix we offer a good QEMU 4.1/4.0.1 for users (I
believe in most cases for individual users since this issue seems to
not affect most enterprise users and with modern hardwares) then we
also tell people to explicitly enable kernel-irqchip=on to avoid the
potential performance degradation.

(Sorry for the late chim-in, and of course sorry for breaking VFIO
 from the very beginning...)

Paolo Bonzini May 6, 2019, 4:13 p.m. UTC | #16

On 05/05/19 04:06, Peter Xu wrote:
>> I wonder if it's possible to untangle this and make the irqchip
>> option stop affecting guest ABI on 4.1+ machine-types?  This way
>> QEMU could choose smarter defaults in the future without the
>> compatibility code hassle.
> Hi, Eduardo,
> 
> Do you mean to enable IOMMU IR for kernel-irqchip=on?  If so, I would
> say it's not trivial...  The major issue is that we probably need to
> explicitly kick QEMU for every kernel IOAPIC setups since QEMU is the
> only one who knows everything about interrupt remapping, while KVM
> don't even have such a mechanism so far.

Right, it's not easy and it would be anyway possible only on kernels.
There would have to be a mechanism to setup IOAPIC->MSI routes, similar
to irqchip=split, and as Peter mentions an MMIO exit on writes to the
routing table.

Paolo

Paolo Bonzini May 6, 2019, 4:17 p.m. UTC | #17

On 03/05/19 15:00, Michael S. Tsirkin wrote:
> On Mon, Apr 29, 2019 at 11:55:56AM -0300, Eduardo Habkost wrote:
>> irqchip=split and irqchip=kernel aren't guest ABI compatible, are
>> they?
> 
> Can you remind me why they aren't?

They are compatible if the userspace IOAPIC is configured with "-global
ioapic.version=0x11".  The userspace IOAPIC in addition supports version
0x20, which adds the EOI register (a requirement for interrupt remapping
but not necessary outside that case), and makes it the default.

Paolo

Eduardo Habkost May 6, 2019, 9:16 p.m. UTC | #18

On Mon, May 06, 2019 at 11:13:28AM -0500, Paolo Bonzini wrote:
> On 05/05/19 04:06, Peter Xu wrote:
> >> I wonder if it's possible to untangle this and make the irqchip
> >> option stop affecting guest ABI on 4.1+ machine-types?  This way
> >> QEMU could choose smarter defaults in the future without the
> >> compatibility code hassle.
> > Hi, Eduardo,
> > 
> > Do you mean to enable IOMMU IR for kernel-irqchip=on?  If so, I would
> > say it's not trivial...  The major issue is that we probably need to
> > explicitly kick QEMU for every kernel IOAPIC setups since QEMU is the
> > only one who knows everything about interrupt remapping, while KVM
> > don't even have such a mechanism so far.
> 
> Right, it's not easy and it would be anyway possible only on kernels.
> There would have to be a mechanism to setup IOAPIC->MSI routes, similar
> to irqchip=split, and as Peter mentions an MMIO exit on writes to the
> routing table.

I don't mean we necessarily should enable IR for
kernel-irqchip=on too.  I'd just prefer the default setting to
not depend on the kernel-irqchip option.

x86-iommu could either have intremap=on as the default (and
refuse to run with kernel-irqchip=on without explicit
intremap=off), or simply default to intremap=off.

But as Paolo indicated elsewhere, this is not the only guest ABI
difference between "on" and "split".  Probably it's not worth the
hassle to try to to untangle this.

[v2,1/3] q35: set split kernel irqchip as default

Commit Message

Comments

Patch