Patchwork [RFC,4/5] APIC/IOAPIC EOI callback

login
register
mail settings
Submitter Alex Williamson
Date July 11, 2010, 6:09 p.m.
Message ID <20100711180936.20121.35376.stgit@localhost6.localdomain6>
Download mbox | patch
Permalink /patch/58534/
State New
Headers show

Comments

Alex Williamson - July 11, 2010, 6:09 p.m.
For device assignment, we need to know when the VM writes an end
of interrupt to the APIC, which allows us to de-assert the interrupt
line and clear the DisINTx bit.  Add a new wrapper for ioapic
generated interrupts with a callback on eoi and create an interface
for drivers to be notified on eoi.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 hw/apic.c   |   18 ++++++++++++++++--
 hw/apic.h   |    4 ++++
 hw/ioapic.c |   29 +++++++++++++++++++++++++++--
 hw/pc.h     |   12 +++++++++++-
 4 files changed, 58 insertions(+), 5 deletions(-)
Avi Kivity - July 11, 2010, 6:14 p.m.
On 07/11/2010 09:09 PM, Alex Williamson wrote:
> For device assignment, we need to know when the VM writes an end
> of interrupt to the APIC, which allows us to de-assert the interrupt
> line and clear the DisINTx bit.  Add a new wrapper for ioapic
> generated interrupts with a callback on eoi and create an interface
> for drivers to be notified on eoi.
>    

You aren't going to get this with kvm's in-kernel irqchip, so we need a 
new interface there.
Alex Williamson - July 11, 2010, 6:26 p.m.
On Sun, 2010-07-11 at 21:14 +0300, Avi Kivity wrote:
> On 07/11/2010 09:09 PM, Alex Williamson wrote:
> > For device assignment, we need to know when the VM writes an end
> > of interrupt to the APIC, which allows us to de-assert the interrupt
> > line and clear the DisINTx bit.  Add a new wrapper for ioapic
> > generated interrupts with a callback on eoi and create an interface
> > for drivers to be notified on eoi.
> >    
> 
> You aren't going to get this with kvm's in-kernel irqchip, so we need a 
> new interface there.

Registering an eventfd for the eoi seems like a reasonable alternative.
I also need to figure out how to avoid bouncing the vfio interrupt
events through qemu, but it's a functional start.  Thanks,

Alex
Avi Kivity - July 11, 2010, 6:30 p.m.
On 07/11/2010 09:26 PM, Alex Williamson wrote:
> On Sun, 2010-07-11 at 21:14 +0300, Avi Kivity wrote:
>    
>> On 07/11/2010 09:09 PM, Alex Williamson wrote:
>>      
>>> For device assignment, we need to know when the VM writes an end
>>> of interrupt to the APIC, which allows us to de-assert the interrupt
>>> line and clear the DisINTx bit.  Add a new wrapper for ioapic
>>> generated interrupts with a callback on eoi and create an interface
>>> for drivers to be notified on eoi.
>>>
>>>        
>> You aren't going to get this with kvm's in-kernel irqchip, so we need a
>> new interface there.
>>      
> Registering an eventfd for the eoi seems like a reasonable alternative.
>    

I'm worried about that racing (with what?)

> I also need to figure out how to avoid bouncing the vfio interrupt
> events through qemu, but it's a functional start.  Thanks,
>    

I thought the scheduler has/wants to have something that moves the irq 
to whatever thread it wakes up.  With irqfd, it would flow naturally.
Michael S. Tsirkin - July 11, 2010, 6:54 p.m.
On Sun, Jul 11, 2010 at 09:30:59PM +0300, Avi Kivity wrote:
> On 07/11/2010 09:26 PM, Alex Williamson wrote:
> >On Sun, 2010-07-11 at 21:14 +0300, Avi Kivity wrote:
> >>On 07/11/2010 09:09 PM, Alex Williamson wrote:
> >>>For device assignment, we need to know when the VM writes an end
> >>>of interrupt to the APIC, which allows us to de-assert the interrupt
> >>>line and clear the DisINTx bit.  Add a new wrapper for ioapic
> >>>generated interrupts with a callback on eoi and create an interface
> >>>for drivers to be notified on eoi.
> >>>
> >>You aren't going to get this with kvm's in-kernel irqchip, so we need a
> >>new interface there.
> >Registering an eventfd for the eoi seems like a reasonable alternative.
> 
> I'm worried about that racing (with what?)

With device asserting the interrupt?
Need to make sure that all possible scenarious work well:

	device asserts interrupt
	driver clears interrupt
	device asserts interrupt
	eoi

	device asserts interrupt
	driver clears interrupt
	eoi
	device asserts interrupt

etc

Not that I see issues, these are things we need to check.

> >I also need to figure out how to avoid bouncing the vfio interrupt
> >events through qemu, but it's a functional start.  Thanks,
> 
> I thought the scheduler has/wants to have something that moves the
> irq to whatever thread it wakes up.  With irqfd, it would flow
> naturally.
> 
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
Alex Williamson - July 11, 2010, 7:21 p.m.
On Sun, 2010-07-11 at 21:54 +0300, Michael S. Tsirkin wrote:
> On Sun, Jul 11, 2010 at 09:30:59PM +0300, Avi Kivity wrote:
> > On 07/11/2010 09:26 PM, Alex Williamson wrote:
> > >On Sun, 2010-07-11 at 21:14 +0300, Avi Kivity wrote:
> > >>On 07/11/2010 09:09 PM, Alex Williamson wrote:
> > >>>For device assignment, we need to know when the VM writes an end
> > >>>of interrupt to the APIC, which allows us to de-assert the interrupt
> > >>>line and clear the DisINTx bit.  Add a new wrapper for ioapic
> > >>>generated interrupts with a callback on eoi and create an interface
> > >>>for drivers to be notified on eoi.
> > >>>
> > >>You aren't going to get this with kvm's in-kernel irqchip, so we need a
> > >>new interface there.
> > >Registering an eventfd for the eoi seems like a reasonable alternative.
> > 
> > I'm worried about that racing (with what?)
> 
> With device asserting the interrupt?
> Need to make sure that all possible scenarious work well:
> 
> 	device asserts interrupt
> 	driver clears interrupt
> 	device asserts interrupt
> 	eoi
> 
> 	device asserts interrupt
> 	driver clears interrupt
> 	eoi
> 	device asserts interrupt
> 
> etc
> 
> Not that I see issues, these are things we need to check.

I think those are all protected by host and qemu vfio drivers managing
DisINTx.  The way I understand it to work now is:

	device asserts interrupt
	interrupt lands in host vfio driver
	host vfio sets DisINTx on the device
	host vfio sends eventfd
	eventfd lands in qemu vfio, does a qemu_set_irq
        ... guest processes
	guest writes eoi to apic, lands back in qemu vfio driver
	qemu vfio deasserts qemu interrupt
	qemu vfio clears DisINTx

So I don't think there's a race as long as ordering is sane for toggling
DisINTx.  Thanks,

Alex
Michael S. Tsirkin - July 11, 2010, 7:23 p.m.
On Sun, Jul 11, 2010 at 01:21:18PM -0600, Alex Williamson wrote:
> On Sun, 2010-07-11 at 21:54 +0300, Michael S. Tsirkin wrote:
> > On Sun, Jul 11, 2010 at 09:30:59PM +0300, Avi Kivity wrote:
> > > On 07/11/2010 09:26 PM, Alex Williamson wrote:
> > > >On Sun, 2010-07-11 at 21:14 +0300, Avi Kivity wrote:
> > > >>On 07/11/2010 09:09 PM, Alex Williamson wrote:
> > > >>>For device assignment, we need to know when the VM writes an end
> > > >>>of interrupt to the APIC, which allows us to de-assert the interrupt
> > > >>>line and clear the DisINTx bit.  Add a new wrapper for ioapic
> > > >>>generated interrupts with a callback on eoi and create an interface
> > > >>>for drivers to be notified on eoi.
> > > >>>
> > > >>You aren't going to get this with kvm's in-kernel irqchip, so we need a
> > > >>new interface there.
> > > >Registering an eventfd for the eoi seems like a reasonable alternative.
> > > 
> > > I'm worried about that racing (with what?)
> > 
> > With device asserting the interrupt?
> > Need to make sure that all possible scenarious work well:
> > 
> > 	device asserts interrupt
> > 	driver clears interrupt
> > 	device asserts interrupt
> > 	eoi
> > 
> > 	device asserts interrupt
> > 	driver clears interrupt
> > 	eoi
> > 	device asserts interrupt
> > 
> > etc
> > 
> > Not that I see issues, these are things we need to check.
> 
> I think those are all protected by host and qemu vfio drivers managing
> DisINTx.  The way I understand it to work now is:
> 
> 	device asserts interrupt
> 	interrupt lands in host vfio driver
> 	host vfio sets DisINTx on the device
> 	host vfio sends eventfd
> 	eventfd lands in qemu vfio, does a qemu_set_irq
>         ... guest processes
> 	guest writes eoi to apic, lands back in qemu vfio driver
> 	qemu vfio deasserts qemu interrupt
> 	qemu vfio clears DisINTx
> 
> So I don't think there's a race as long as ordering is sane for toggling
> DisINTx.  Thanks,
> 
> Alex
> 

What about threaded interrupts? I think (correct me if I am wrong)
that they work like this:

 	device asserts interrupt
	guest disables interrupt
 	eoi
	guest enables interrupt
 	driver clears interrupt
 	device asserts interrupt

If so, your code will clear DisINTx immediately which
will always get us another host interrupt:
performance will be hurt. I am also not sure
we'll not lose interrupts.

It seems we need to track interrupt disable/enable as well, and only
clear DisINTx after eoi with interrupts enabled.  Not sure what is the
interface for this.
Alex Williamson - July 11, 2010, 8:03 p.m.
On Sun, 2010-07-11 at 22:23 +0300, Michael S. Tsirkin wrote:
> On Sun, Jul 11, 2010 at 01:21:18PM -0600, Alex Williamson wrote:
> > On Sun, 2010-07-11 at 21:54 +0300, Michael S. Tsirkin wrote:
> > > On Sun, Jul 11, 2010 at 09:30:59PM +0300, Avi Kivity wrote:
> > > > On 07/11/2010 09:26 PM, Alex Williamson wrote:
> > > > >On Sun, 2010-07-11 at 21:14 +0300, Avi Kivity wrote:
> > > > >>On 07/11/2010 09:09 PM, Alex Williamson wrote:
> > > > >>>For device assignment, we need to know when the VM writes an end
> > > > >>>of interrupt to the APIC, which allows us to de-assert the interrupt
> > > > >>>line and clear the DisINTx bit.  Add a new wrapper for ioapic
> > > > >>>generated interrupts with a callback on eoi and create an interface
> > > > >>>for drivers to be notified on eoi.
> > > > >>>
> > > > >>You aren't going to get this with kvm's in-kernel irqchip, so we need a
> > > > >>new interface there.
> > > > >Registering an eventfd for the eoi seems like a reasonable alternative.
> > > > 
> > > > I'm worried about that racing (with what?)
> > > 
> > > With device asserting the interrupt?
> > > Need to make sure that all possible scenarious work well:
> > > 
> > > 	device asserts interrupt
> > > 	driver clears interrupt
> > > 	device asserts interrupt
> > > 	eoi
> > > 
> > > 	device asserts interrupt
> > > 	driver clears interrupt
> > > 	eoi
> > > 	device asserts interrupt
> > > 
> > > etc
> > > 
> > > Not that I see issues, these are things we need to check.
> > 
> > I think those are all protected by host and qemu vfio drivers managing
> > DisINTx.  The way I understand it to work now is:
> > 
> > 	device asserts interrupt
> > 	interrupt lands in host vfio driver
> > 	host vfio sets DisINTx on the device
> > 	host vfio sends eventfd
> > 	eventfd lands in qemu vfio, does a qemu_set_irq
> >         ... guest processes
> > 	guest writes eoi to apic, lands back in qemu vfio driver
> > 	qemu vfio deasserts qemu interrupt
> > 	qemu vfio clears DisINTx
> > 
> > So I don't think there's a race as long as ordering is sane for toggling
> > DisINTx.  Thanks,
> > 
> > Alex
> > 
> 
> What about threaded interrupts? I think (correct me if I am wrong)
> that they work like this:
> 
>  	device asserts interrupt
> 	guest disables interrupt

Is this the guest manipulating DisINTx itself?  I suppose it could be a
device dependent disable as well.

>  	eoi
> 	guest enables interrupt
>  	driver clears interrupt

These two are hopefully reversed or else the driver is expecting to
clear and potentially reassert interrupts anyway.

>  	device asserts interrupt
> 
> If so, your code will clear DisINTx immediately which
> will always get us another host interrupt:
> performance will be hurt. I am also not sure
> we'll not lose interrupts.

Level interrupts are lossy afaik, if it gets cleared but an interrupt
condition still exists, it should be reasserted.

> It seems we need to track interrupt disable/enable as well, and only
> clear DisINTx after eoi with interrupts enabled.  Not sure what is the
> interface for this.

If a driver uses device dependent code to disable interrupts, there's no
issue, we'll clear DisINTx, but the device still won't generate an
interrupt until the dependent code is re-enabled by the guest (assuming
there's no cross talk between DisINTx and device dependent components).

For the case that a guest driver disables via DisINTx, it seems easy to
trap and track that.  So we get:

        device asserts interrupt
        guest disables interrupt
        (trapped, qemu-vfio sets intx.guest_disabled = 1)
        eoi
        (qemu-vfio deasserts qemu interrupts, but because of above doesn't clear DisINTx)
        guest enables interrupt
        (allowed to pass through, intx.guest_disabled = 0)
        driver clears interrupt
        device asserts interrupt

I've already got an intx.pending bit, so I think this just changes the eoi to:

    vdev->intx.pending = 0;
    qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
    if (!vdev->intx.guest_disabled) {
        vfio_unmask_intx(vdev);
    }

Writing the command register DisINTx bit then just gets some kind of:

    if (cmd & PCI_COMMAND_INTX_DISABLE && intx.pending) {
        intx.guest_disabled = 1;
        cmd &= ~PCI_COMMAND_INTX_DISABLE;
    } else if (!(cmd & PCI_COMMAND_INTX_DISABLE) && intx.guest_disabled) {
        intx.guest_disabled = 0;
    }
    ... allow write

That work?  Thanks,

Alex
Michael S. Tsirkin - July 11, 2010, 8:05 p.m.
On Sun, Jul 11, 2010 at 02:03:34PM -0600, Alex Williamson wrote:
> On Sun, 2010-07-11 at 22:23 +0300, Michael S. Tsirkin wrote:
> > On Sun, Jul 11, 2010 at 01:21:18PM -0600, Alex Williamson wrote:
> > > On Sun, 2010-07-11 at 21:54 +0300, Michael S. Tsirkin wrote:
> > > > On Sun, Jul 11, 2010 at 09:30:59PM +0300, Avi Kivity wrote:
> > > > > On 07/11/2010 09:26 PM, Alex Williamson wrote:
> > > > > >On Sun, 2010-07-11 at 21:14 +0300, Avi Kivity wrote:
> > > > > >>On 07/11/2010 09:09 PM, Alex Williamson wrote:
> > > > > >>>For device assignment, we need to know when the VM writes an end
> > > > > >>>of interrupt to the APIC, which allows us to de-assert the interrupt
> > > > > >>>line and clear the DisINTx bit.  Add a new wrapper for ioapic
> > > > > >>>generated interrupts with a callback on eoi and create an interface
> > > > > >>>for drivers to be notified on eoi.
> > > > > >>>
> > > > > >>You aren't going to get this with kvm's in-kernel irqchip, so we need a
> > > > > >>new interface there.
> > > > > >Registering an eventfd for the eoi seems like a reasonable alternative.
> > > > > 
> > > > > I'm worried about that racing (with what?)
> > > > 
> > > > With device asserting the interrupt?
> > > > Need to make sure that all possible scenarious work well:
> > > > 
> > > > 	device asserts interrupt
> > > > 	driver clears interrupt
> > > > 	device asserts interrupt
> > > > 	eoi
> > > > 
> > > > 	device asserts interrupt
> > > > 	driver clears interrupt
> > > > 	eoi
> > > > 	device asserts interrupt
> > > > 
> > > > etc
> > > > 
> > > > Not that I see issues, these are things we need to check.
> > > 
> > > I think those are all protected by host and qemu vfio drivers managing
> > > DisINTx.  The way I understand it to work now is:
> > > 
> > > 	device asserts interrupt
> > > 	interrupt lands in host vfio driver
> > > 	host vfio sets DisINTx on the device
> > > 	host vfio sends eventfd
> > > 	eventfd lands in qemu vfio, does a qemu_set_irq
> > >         ... guest processes
> > > 	guest writes eoi to apic, lands back in qemu vfio driver
> > > 	qemu vfio deasserts qemu interrupt
> > > 	qemu vfio clears DisINTx
> > > 
> > > So I don't think there's a race as long as ordering is sane for toggling
> > > DisINTx.  Thanks,
> > > 
> > > Alex
> > > 
> > 
> > What about threaded interrupts? I think (correct me if I am wrong)
> > that they work like this:
> > 
> >  	device asserts interrupt
> > 	guest disables interrupt
> 
> Is this the guest manipulating DisINTx itself?  I suppose it could be a
> device dependent disable as well.

It can manipulate it, so we need to virtualize it, but that's a
separate issue.

> >  	eoi
> > 	guest enables interrupt
> >  	driver clears interrupt
> 
> These two are hopefully reversed or else the driver is expecting to
> clear and potentially reassert interrupts anyway.

Yes. Sorry.

> >  	device asserts interrupt
> > 
> > If so, your code will clear DisINTx immediately which
> > will always get us another host interrupt:
> > performance will be hurt. I am also not sure
> > we'll not lose interrupts.
> 
> Level interrupts are lossy afaik, if it gets cleared but an interrupt
> condition still exists, it should be reasserted.

Yes but I mean we won't interrupt the guest. So it wil lstay disabled
forever.

> > It seems we need to track interrupt disable/enable as well, and only
> > clear DisINTx after eoi with interrupts enabled.  Not sure what is the
> > interface for this.
> 
> If a driver uses device dependent code to disable interrupts,
> there's no
> issue, we'll clear DisINTx, but the device still won't generate an
> interrupt until the dependent code is re-enabled by the guest (assuming
> there's no cross talk between DisINTx and device dependent components).
> 
> For the case that a guest driver disables via DisINTx, it seems easy to
> trap and track that.  So we get:
> 
>         device asserts interrupt
>         guest disables interrupt
>         (trapped, qemu-vfio sets intx.guest_disabled = 1)
>         eoi
>         (qemu-vfio deasserts qemu interrupts, but because of above doesn't clear DisINTx)
>         guest enables interrupt
>         (allowed to pass through, intx.guest_disabled = 0)
>         driver clears interrupt
>         device asserts interrupt
> 
> I've already got an intx.pending bit, so I think this just changes the eoi to:
> 
>     vdev->intx.pending = 0;
>     qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
>     if (!vdev->intx.guest_disabled) {
>         vfio_unmask_intx(vdev);
>     }
> 
> Writing the command register DisINTx bit then just gets some kind of:
> 
>     if (cmd & PCI_COMMAND_INTX_DISABLE && intx.pending) {
>         intx.guest_disabled = 1;
>         cmd &= ~PCI_COMMAND_INTX_DISABLE;
>     } else if (!(cmd & PCI_COMMAND_INTX_DISABLE) && intx.guest_disabled) {
>         intx.guest_disabled = 0;
>     }
>     ... allow write
> 
> That work?  Thanks,
> 
> Alex

No, I mean guest OS disables the specific interrupt with
disable_irq.
Michael S. Tsirkin - July 11, 2010, 8:12 p.m.
On Sun, Jul 11, 2010 at 02:03:34PM -0600, Alex Williamson wrote:
> > What about threaded interrupts?

Just to make the point, imagine a nested virt situation
which uses current kvm device assignment in guest.
Look at the interrupt handler we have there.
Alex Williamson - July 11, 2010, 9:59 p.m.
On Sun, 2010-07-11 at 23:12 +0300, Michael S. Tsirkin wrote:
> On Sun, Jul 11, 2010 at 02:03:34PM -0600, Alex Williamson wrote:
> > > What about threaded interrupts?
> 
> Just to make the point, imagine a nested virt situation
> which uses current kvm device assignment in guest.
> Look at the interrupt handler we have there.

Is the problem you're worried about this:

	guest masks qemu ioapic rte
	device interrupt
	host vfio DisINTx+
	qemu vfio calls qemu_set_irq
	...

In that case, the qemu ioapic irr bit is only toggled by qemu_set_irq
for level triggered interrupts, so the interrupt will be asserted in the
guest when it gets unmasked and we'll get the eoi.

I can't figure out where your other scenario can leave the DisINTx+:

	device asserts interrupt
	a) DisINTx+ via host vfio
	guest disables interrupt
	b) DisINTx+ via guest, already set
	eoi
	c) DisINTx- via qemu vfio
	driver clears interrupt        
	guest enables interrupt
	d) DisINTx- via guest
	device asserts interrupt

So between c) & d) we're potentially getting more interrupts than we
want, but I can't see anywhere that we drop DisINTx.  If you have a
scenario, let me know.  Thanks,

Alex
Avi Kivity - July 12, 2010, 6:33 a.m.
On 07/11/2010 09:30 PM, Avi Kivity wrote:
>> Registering an eventfd for the eoi seems like a reasonable alternative.
>
> I'm worried about that racing (with what?)

I don't think there's a problem.

First, the EOI message is itself asynchronous.  While the write to the 
local APIC is synchronous, effects on the rest of the system are 
effected using an APIC message, which travels asynchronously.

Second, a component that needs timely information doesn't have to wait; 
it can read the eventfd and be sure it has seen all EOIs up to now.
Gleb Natapov - July 12, 2010, 9:05 a.m.
On Mon, Jul 12, 2010 at 09:33:12AM +0300, Avi Kivity wrote:
> On 07/11/2010 09:30 PM, Avi Kivity wrote:
> >>Registering an eventfd for the eoi seems like a reasonable alternative.
> >
> >I'm worried about that racing (with what?)
> 
> I don't think there's a problem.
> 
> First, the EOI message is itself asynchronous.  While the write to
> the local APIC is synchronous, effects on the rest of the system are
> effected using an APIC message, which travels asynchronously.
> 
> Second, a component that needs timely information doesn't have to
> wait; it can read the eventfd and be sure it has seen all EOIs up to
> now.
> 
I remember we already discussed the use of eventfd for reporting EOI and 
decided against it, but I don't remember why. :( Was it because if we
are going to export EOI to userspace anyway we want to be able to use it
for RTC timedrift fixing and for that we need to know what CPU called
EOI and eventfd can't provide that?

--
			Gleb.
Avi Kivity - July 12, 2010, 9:13 a.m.
On 07/12/2010 12:05 PM, Gleb Natapov wrote:
> On Mon, Jul 12, 2010 at 09:33:12AM +0300, Avi Kivity wrote:
>    
>> On 07/11/2010 09:30 PM, Avi Kivity wrote:
>>      
>>>> Registering an eventfd for the eoi seems like a reasonable alternative.
>>>>          
>>> I'm worried about that racing (with what?)
>>>        
>> I don't think there's a problem.
>>
>> First, the EOI message is itself asynchronous.  While the write to
>> the local APIC is synchronous, effects on the rest of the system are
>> effected using an APIC message, which travels asynchronously.
>>
>> Second, a component that needs timely information doesn't have to
>> wait; it can read the eventfd and be sure it has seen all EOIs up to
>> now.
>>
>>      
> I remember we already discussed the use of eventfd for reporting EOI and
> decided against it, but I don't remember why. :( Was it because if we
> are going to export EOI to userspace anyway we want to be able to use it
> for RTC timedrift fixing and for that we need to know what CPU called
> EOI and eventfd can't provide that?
>    

IIRC it was the synchronity argument.  But it's bogus: if the RTC wants 
to know whether an ack occured before it makes some decision, all it has 
to do is read() the eventfd and find out.

Another issue is which cpu issued the ack.  I suppose we can have 
per-vcpu eventfds, though that's ugly.

Patch

diff --git a/hw/apic.c b/hw/apic.c
index d686b51..8f512df 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -21,6 +21,7 @@ 
 #include "qemu-timer.h"
 #include "host-utils.h"
 #include "sysbus.h"
+#include "pc.h"
 
 //#define DEBUG_APIC
 //#define DEBUG_COALESCING
@@ -119,6 +120,7 @@  struct APICState {
     int wait_for_sipi;
 };
 
+static uint8_t vector_to_gsi_map[256] = { 0xff };
 static APICState *local_apics[MAX_APICS + 1];
 static int apic_irq_delivered;
 
@@ -308,6 +310,15 @@  void apic_deliver_irq(uint8_t dest, uint8_t dest_mode,
                      trigger_mode);
 }
 
+void apic_deliver_ioapic_irq(uint8_t dest, uint8_t dest_mode,
+                             uint8_t delivery_mode, uint8_t vector_num,
+                             uint8_t polarity, uint8_t trigger_mode, int gsi)
+{
+    vector_to_gsi_map[vector_num] = gsi;
+    apic_deliver_irq(dest, dest_mode, delivery_mode,
+                     vector_num, polarity, trigger_mode);
+}
+
 void cpu_set_apic_base(DeviceState *d, uint64_t val)
 {
     APICState *s = DO_UPCAST(APICState, busdev.qdev, d);
@@ -432,8 +443,11 @@  static void apic_eoi(APICState *s)
     if (isrv < 0)
         return;
     reset_bit(s->isr, isrv);
-    /* XXX: send the EOI packet to the APIC bus to allow the I/O APIC to
-            set the remote IRR bit for level triggered interrupts. */
+  
+    if (vector_to_gsi_map[isrv] != 0xff) {
+        ioapic_eoi(vector_to_gsi_map[isrv]);
+        vector_to_gsi_map[isrv] = 0xff;
+    }
     apic_update_irq(s);
 }
 
diff --git a/hw/apic.h b/hw/apic.h
index 8a0c9d0..59d0e37 100644
--- a/hw/apic.h
+++ b/hw/apic.h
@@ -8,6 +8,10 @@  void apic_deliver_irq(uint8_t dest, uint8_t dest_mode,
                              uint8_t delivery_mode,
                              uint8_t vector_num, uint8_t polarity,
                              uint8_t trigger_mode);
+void apic_deliver_ioapic_irq(uint8_t dest, uint8_t dest_mode,
+                             uint8_t delivery_mode,
+                             uint8_t vector_num, uint8_t polarity,
+                             uint8_t trigger_mode, int gsi);
 int apic_accept_pic_intr(DeviceState *s);
 void apic_deliver_pic_intr(DeviceState *s, int level);
 int apic_get_interrupt(DeviceState *s);
diff --git a/hw/ioapic.c b/hw/ioapic.c
index 5ae21e9..1e2fc2e 100644
--- a/hw/ioapic.c
+++ b/hw/ioapic.c
@@ -26,6 +26,7 @@ 
 #include "qemu-timer.h"
 #include "host-utils.h"
 #include "sysbus.h"
+#include "qlist.h"
 
 //#define DEBUG_IOAPIC
 
@@ -61,6 +62,30 @@  struct IOAPICState {
     uint64_t ioredtbl[IOAPIC_NUM_PINS];
 };
 
+static QLIST_HEAD(ioapic_eoi_client_list,
+                  ioapic_eoi_client) ioapic_eoi_client_list =
+                  QLIST_HEAD_INITIALIZER(ioapic_eoi_client_list);
+
+void ioapic_register_eoi_client(ioapic_eoi_client *client)
+{
+    QLIST_INSERT_HEAD(&ioapic_eoi_client_list, client, list);
+}
+
+void ioapic_unregister_eoi_client(ioapic_eoi_client *client)
+{
+    QLIST_REMOVE(client, list);
+}
+
+void ioapic_eoi(int gsi)
+{
+    ioapic_eoi_client *client;
+    QLIST_FOREACH(client, &ioapic_eoi_client_list, list) {
+        if (client->irq == gsi) {
+            client->eoi(client);
+        }
+    }
+}
+
 static void ioapic_service(IOAPICState *s)
 {
     uint8_t i;
@@ -90,8 +115,8 @@  static void ioapic_service(IOAPICState *s)
                 else
                     vector = entry & 0xff;
 
-                apic_deliver_irq(dest, dest_mode, delivery_mode,
-                                 vector, polarity, trig_mode);
+                apic_deliver_ioapic_irq(dest, dest_mode, delivery_mode,
+                                        vector, polarity, trig_mode, i);
             }
         }
     }
diff --git a/hw/pc.h b/hw/pc.h
index 63b0249..a88019f 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -48,8 +48,18 @@  typedef struct isa_irq_state {
 
 void isa_irq_handler(void *opaque, int n, int level);
 
-/* i8254.c */
+struct ioapic_eoi_client;
+typedef struct ioapic_eoi_client ioapic_eoi_client;
+struct ioapic_eoi_client {
+    void (*eoi)(struct ioapic_eoi_client *client);
+    int irq;
+    QLIST_ENTRY(ioapic_eoi_client) list;
+};
+void ioapic_register_eoi_client(ioapic_eoi_client *client);
+void ioapic_unregister_eoi_client(ioapic_eoi_client *client);
+void ioapic_eoi(int gsi);
 
+/* i8254.c */
 #define PIT_FREQ 1193182
 
 typedef struct PITState PITState;