Patchwork [09/16] Enable message delivery via IRQs

login
register
mail settings
Submitter Jan Kiszka
Date June 6, 2010, 8:10 a.m.
Message ID <c875bf0470b4f0279e93d77c55d01bc50a12f1fc.1275811861.git.jan.kiszka@web.de>
Download mbox | patch
Permalink /patch/54805/
State New
Headers show

Comments

Jan Kiszka - June 6, 2010, 8:10 a.m.
From: Jan Kiszka <jan.kiszka@siemens.com>

This patch allows to optionally attach a message to an IRQ event. The
message can contain a payload reference and a callback that the IRQ
handler may invoke to report the delivery result. The former can be used
to model message signaling interrupts, the latter to cleanly implement
IRQ de-coalescing logics.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/irq.c |   37 ++++++++++++++++++++++++++++++++++++-
 hw/irq.h |   38 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 73 insertions(+), 2 deletions(-)
Paul Brook - June 12, 2010, 12:21 p.m.
> This patch allows to optionally attach a message to an IRQ event. The
> message can contain a payload reference and a callback that the IRQ
> handler may invoke to report the delivery result. The former can be used
> to model message signaling interrupts, the latter to cleanly implement
> IRQ de-coalescing logics.

I don't like this. qemu_irq is a level triggered interface. Redundant calls to 
qemu_set_irq should (in principle) be a no-op.  If you want message passing 
then IMO you should be using something else.

Paul
Jan Kiszka - June 12, 2010, 12:32 p.m.
Paul Brook wrote:
>> This patch allows to optionally attach a message to an IRQ event. The
>> message can contain a payload reference and a callback that the IRQ
>> handler may invoke to report the delivery result. The former can be used
>> to model message signaling interrupts, the latter to cleanly implement
>> IRQ de-coalescing logics.
> 
> I don't like this. qemu_irq is a level triggered interface. Redundant calls to 
> qemu_set_irq should (in principle) be a no-op.  If you want message passing 
> then IMO you should be using something else.

I think we all now agree that we need some extension to the qemu_irq
interface to realize clean irq de-coalescing.

Returning a delivery code was proposed several times but always
rejected. Blueswirl brought up that message passing can model this as
well and that it has additional use for sparc. So I followed this
suggestion, specifically due to the added value and the absence of
concerns by other parties. Now, after spending yet another "few" hours
on this series, you raise a new concern. Wouldn't it be possible to do
this during the design discussion?

OK, what is your proposal now for an IRQ de-coalescing framework?

Jan
Blue Swirl - June 12, 2010, 1:44 p.m.
On Sat, Jun 12, 2010 at 12:21 PM, Paul Brook <paul@codesourcery.com> wrote:
>> This patch allows to optionally attach a message to an IRQ event. The
>> message can contain a payload reference and a callback that the IRQ
>> handler may invoke to report the delivery result. The former can be used
>> to model message signaling interrupts, the latter to cleanly implement
>> IRQ de-coalescing logics.
>
> I don't like this. qemu_irq is a level triggered interface. Redundant calls to
> qemu_set_irq should (in principle) be a no-op.  If you want message passing
> then IMO you should be using something else.

Keeping the optional message and qemu_irq together means that we can
reuse the existing IRQ subsystem. I'd guess something more separated
would need duplicate allocation and delivery support and maybe even
SysBus etc. would need lots of work to support a new class of IRQs.

Also detection of coalesced interrupts depends on reliably detecting
sort of redundant calls to qemu_set_irq. There we don't model real
level triggered hardware exactly but add some extra magic.
Paul Brook - June 12, 2010, 2:15 p.m.
> On Sat, Jun 12, 2010 at 12:21 PM, Paul Brook <paul@codesourcery.com> wrote:
> >> This patch allows to optionally attach a message to an IRQ event. The
> >> message can contain a payload reference and a callback that the IRQ
> >> handler may invoke to report the delivery result. The former can be used
> >> to model message signaling interrupts, the latter to cleanly implement
> >> IRQ de-coalescing logics.
> > 
> > I don't like this. qemu_irq is a level triggered interface. Redundant
> > calls to qemu_set_irq should (in principle) be a no-op.  If you want
> > message passing then IMO you should be using something else.
> 
> Keeping the optional message and qemu_irq together means that we can
> reuse the existing IRQ subsystem. I'd guess something more separated
> would need duplicate allocation and delivery support and maybe even
> SysBus etc. would need lots of work to support a new class of IRQs.

How do you propose message passing is handled when you have nested multi-layer 
interrupt trees? How long is the message data valid for? Who owns it? How is a 
receiver meant to know for format of the message being delivered, and who it's 
intended for?

IMO message triggered systems are fundamentally different to level states. 
qemu_irq represents a level state, and I'd really like for it to stay that 
way.  

If we need/want a generic message passing interface, then that's a different 
problem, and needs to be done in such a way that the devices always agree on 
the type of message being passed.

TBH I preferred the original system whereby the source can query the state of 
the sink (i.e "are you ignoring this line?").  Note that conceptually this 
should be *querying* state, not responding to an event.

Paul
Blue Swirl - June 12, 2010, 2:35 p.m.
On Sat, Jun 12, 2010 at 2:15 PM, Paul Brook <paul@codesourcery.com> wrote:
>> On Sat, Jun 12, 2010 at 12:21 PM, Paul Brook <paul@codesourcery.com> wrote:
>> >> This patch allows to optionally attach a message to an IRQ event. The
>> >> message can contain a payload reference and a callback that the IRQ
>> >> handler may invoke to report the delivery result. The former can be used
>> >> to model message signaling interrupts, the latter to cleanly implement
>> >> IRQ de-coalescing logics.
>> >
>> > I don't like this. qemu_irq is a level triggered interface. Redundant
>> > calls to qemu_set_irq should (in principle) be a no-op.  If you want
>> > message passing then IMO you should be using something else.
>>
>> Keeping the optional message and qemu_irq together means that we can
>> reuse the existing IRQ subsystem. I'd guess something more separated
>> would need duplicate allocation and delivery support and maybe even
>> SysBus etc. would need lots of work to support a new class of IRQs.
>
> How do you propose message passing is handled when you have nested multi-layer
> interrupt trees?

Do we have such trees somewhere? I think message passing interrupts
are only used in bus based systems, like PCI or UPA/JBUS etc. I don't
know how LAPIC/IOAPIC bus works, it could be similar.

> How long is the message data valid for?

Currently it's valid for the duration of the call. It may make sense
to extend that later in order to handle EOIs.

> Who owns it?

The source.

> How is a
> receiver meant to know for format of the message being delivered, and who it's
> intended for?

This depends on the architecture. On Sparc64 the message is specified
by the guest or OF.

> IMO message triggered systems are fundamentally different to level states.
> qemu_irq represents a level state, and I'd really like for it to stay that
> way.

In real HW (MSI or Sparc64 mondo interrupts) a bus delivers the
message. But our buses only handle address and data lines.

> If we need/want a generic message passing interface, then that's a different
> problem, and needs to be done in such a way that the devices always agree on
> the type of message being passed.
>
> TBH I preferred the original system whereby the source can query the state of
> the sink (i.e "are you ignoring this line?").  Note that conceptually this
> should be *querying* state, not responding to an event.

It's simple, but I think too simple. It would work for the coalescing
interrupt hack but useless for other things.
Paul Brook - June 12, 2010, 3:58 p.m.
> I think message passing interrupts
> are only used in bus based systems, like PCI or UPA/JBUS etc. I don't
> know how LAPIC/IOAPIC bus works, it could be similar.

PCI Message Signalled Interrupts use a regular data write, and we model it 
exactly that way. Under normal circumstances you program the device to write 
to memory mapped APIC control registers, but there's no real reason why you 
couldn't write to RAM.

LAPIC/IOAPIC have their own bus. On some hardware this is multiplexed over the 
main system bus, but logically it is a separate interconnect.

The fact that these a bus based systems suggests against using qemu_irq, which 
is an inherently point-point system.

> > How is a
> > receiver meant to know for format of the message being delivered, and who
> > it's intended for?
> 
> This depends on the architecture. On Sparc64 the message is specified
> by the guest or OF.
>...
> In real HW (MSI or Sparc64 mondo interrupts) a bus delivers the
> message. But our buses only handle address and data lines.

IIUC you're trying to use qemu_irq as a generic interconnect between devices. 
I think that's going to cause more problems than it solves.  If a bus has 
additional message passing capabilities then this can be part of the bus 
interface.

If two devices need a private communication channel then we probably want some 
implementation agnostic way of defining this link.  Connecting LAPIC to IOAPIC 
is a potential example of this. However this needs to be done in a type-safe 
manner.  DMA channels may be another potential use of this linking. We'd want 
to avoid connecting a DMA channel to an APIC.

[*] A simple unidirectional dma request line is suitable for qmu_irq. A DMA 
system that transfers data outside of memory read/write transactions is not. 
e.g. ISA effectively defines a regular memory bus plus 8 bidirectional data 
streams (aka DMA channels). These are multiplexed over the same physical pins, 
but logically distinct. The DMA channels could either be bundled up into the 
ISA bus interface, or modelled as links between devices and the DMA 
controller.

> > TBH I preferred the original system whereby the source can query the
> > state of the sink (i.e "are you ignoring this line?").  Note that
> > conceptually this should be querying state, not responding to an event.
> 
> It's simple, but I think too simple. It would work for the coalescing
> interrupt hack but useless for other things.

I'm not convinced making qemu_irq do "other things" is a good idea.

Paul
Blue Swirl - June 12, 2010, 7:33 p.m.
On Sat, Jun 12, 2010 at 3:58 PM, Paul Brook <paul@codesourcery.com> wrote:
>> I think message passing interrupts
>> are only used in bus based systems, like PCI or UPA/JBUS etc. I don't
>> know how LAPIC/IOAPIC bus works, it could be similar.
>
> PCI Message Signalled Interrupts use a regular data write, and we model it
> exactly that way. Under normal circumstances you program the device to write
> to memory mapped APIC control registers, but there's no real reason why you
> couldn't write to RAM.
>
> LAPIC/IOAPIC have their own bus. On some hardware this is multiplexed over the
> main system bus, but logically it is a separate interconnect.
>
> The fact that these a bus based systems suggests against using qemu_irq, which
> is an inherently point-point system.
>
>> > How is a
>> > receiver meant to know for format of the message being delivered, and who
>> > it's intended for?
>>
>> This depends on the architecture. On Sparc64 the message is specified
>> by the guest or OF.
>>...
>> In real HW (MSI or Sparc64 mondo interrupts) a bus delivers the
>> message. But our buses only handle address and data lines.
>
> IIUC you're trying to use qemu_irq as a generic interconnect between devices.
> I think that's going to cause more problems than it solves.  If a bus has
> additional message passing capabilities then this can be part of the bus
> interface.
>
> If two devices need a private communication channel then we probably want some
> implementation agnostic way of defining this link.  Connecting LAPIC to IOAPIC
> is a potential example of this. However this needs to be done in a type-safe
> manner.  DMA channels may be another potential use of this linking. We'd want
> to avoid connecting a DMA channel to an APIC.
>
> [*] A simple unidirectional dma request line is suitable for qmu_irq. A DMA
> system that transfers data outside of memory read/write transactions is not.
> e.g. ISA effectively defines a regular memory bus plus 8 bidirectional data
> streams (aka DMA channels). These are multiplexed over the same physical pins,
> but logically distinct. The DMA channels could either be bundled up into the
> ISA bus interface, or modelled as links between devices and the DMA
> controller.

Very very interesting. There's some out of band data related to DMA
(bus errors, IOMMU faults, ISA DREQ) which hasn't been covered in
previous Generic DMA proposals. Maybe all we need is a generic side
band link, which would be usable for point to point (qemu_irq,
coalescing, messages) and bus channels?

>> > TBH I preferred the original system whereby the source can query the
>> > state of the sink (i.e "are you ignoring this line?").  Note that
>> > conceptually this should be querying state, not responding to an event.
>>
>> It's simple, but I think too simple. It would work for the coalescing
>> interrupt hack but useless for other things.
>
> I'm not convinced making qemu_irq do "other things" is a good idea.
Paul Brook - June 12, 2010, 8:15 p.m.
> > [*] A simple unidirectional dma request line is suitable for qmu_irq. A
> > DMA system that transfers data outside of memory read/write transactions
> > is not. e.g. ISA effectively defines a regular memory bus plus 8
> > bidirectional data streams (aka DMA channels). These are multiplexed
> > over the same physical pins, but logically distinct. The DMA channels
> > could either be bundled up into the ISA bus interface, or modelled as
> > links between devices and the DMA controller.
> 
> Very very interesting. There's some out of band data related to DMA
> (bus errors, IOMMU faults, ISA DREQ) which hasn't been covered in
> previous Generic DMA proposals.

I suspect you're confusing two different concepts - bus-master DMA (memory 
accesses generated by devices) v.s. IO channel DMA (out of band data stream 
between a device and a third party DMA engine). Despite both being called 
"DMA", they significantly different beasts. The IO channel DMA master is 
effectively a bridge between the memory bus (accessed via bus-master DMA) and 
the device IO channel.

Note that actual IO channel DMA is relatively rare. While many embedded 
devices still have a DMA engine this tends to generate regular memory 
accesses, which are then programmed to access a memory mapped FIFO. This 
shouldn't need any special magic to implement, typically just a simple 
qemu_irq control line.

> Maybe all we need is a generic side
> band link, which would be usable for point to point (qemu_irq,
> coalescing, messages) and bus channels?

We may want to consider a common mechanism for mapping links between devices. 
i.e some way of saying "Connect device X port N to device Y port M". In the 
case of traditional interrupts each IRQ source or sink would be a "port". For 
a bus system you'd probably want to connect them all to a [virtual] bus 
arbiter object.

This should allow the mapping core to verify that e.g. you're not connecting a 
qemu_irq to something expecting an apic_message, preferably without knowing or 
caring how either of those work.

In theory I guess you could also use the same mechanism to connect network 
devices to their host interface, etc.  I haven't thought this bit through 
fully, so this may be excessively clever.

Paul
Blue Swirl - June 12, 2010, 8:32 p.m.
On Sat, Jun 12, 2010 at 7:33 PM, Blue Swirl <blauwirbel@gmail.com> wrote:
> On Sat, Jun 12, 2010 at 3:58 PM, Paul Brook <paul@codesourcery.com> wrote:
>>> I think message passing interrupts
>>> are only used in bus based systems, like PCI or UPA/JBUS etc. I don't
>>> know how LAPIC/IOAPIC bus works, it could be similar.
>>
>> PCI Message Signalled Interrupts use a regular data write, and we model it
>> exactly that way. Under normal circumstances you program the device to write
>> to memory mapped APIC control registers, but there's no real reason why you
>> couldn't write to RAM.
>>
>> LAPIC/IOAPIC have their own bus. On some hardware this is multiplexed over the
>> main system bus, but logically it is a separate interconnect.
>>
>> The fact that these a bus based systems suggests against using qemu_irq, which
>> is an inherently point-point system.
>>
>>> > How is a
>>> > receiver meant to know for format of the message being delivered, and who
>>> > it's intended for?
>>>
>>> This depends on the architecture. On Sparc64 the message is specified
>>> by the guest or OF.
>>>...
>>> In real HW (MSI or Sparc64 mondo interrupts) a bus delivers the
>>> message. But our buses only handle address and data lines.
>>
>> IIUC you're trying to use qemu_irq as a generic interconnect between devices.
>> I think that's going to cause more problems than it solves.  If a bus has
>> additional message passing capabilities then this can be part of the bus
>> interface.
>>
>> If two devices need a private communication channel then we probably want some
>> implementation agnostic way of defining this link.  Connecting LAPIC to IOAPIC
>> is a potential example of this. However this needs to be done in a type-safe
>> manner.  DMA channels may be another potential use of this linking. We'd want
>> to avoid connecting a DMA channel to an APIC.
>>
>> [*] A simple unidirectional dma request line is suitable for qmu_irq. A DMA
>> system that transfers data outside of memory read/write transactions is not.
>> e.g. ISA effectively defines a regular memory bus plus 8 bidirectional data
>> streams (aka DMA channels). These are multiplexed over the same physical pins,
>> but logically distinct. The DMA channels could either be bundled up into the
>> ISA bus interface, or modelled as links between devices and the DMA
>> controller.
>
> Very very interesting. There's some out of band data related to DMA
> (bus errors, IOMMU faults, ISA DREQ) which hasn't been covered in
> previous Generic DMA proposals. Maybe all we need is a generic side
> band link, which would be usable for point to point (qemu_irq,
> coalescing, messages) and bus channels?

I think we could solve all problems (well, maybe not world peace, yet)
by switching to message based system for all of DMA and IRQs.

Each device would have a message input port and way to output messages.

Examples:

Zero copy memory access from device D1 to D2 to host memory (D3) with
access broken to page length units and errors occurring on the last
byte:
D1 send_msg(ID, MSG_MEM_WRITE, DMA address, length) -> D2
D2 send_msg(ID, MSG_MEM_WRITE, DMA address2, length) -> D3
D3 send_replymsg(ID, MSG_MEM_PTR, host address, 4096) -> D2
D2 send_replymsg(ID, MSG_MEM_PTR, host address, 4096) -> D1
D3 send_replymsg(ID, MSG_MEM_PTR, host address, length - 4096 - 1) -> D2
D2 send_replymsg(ID, MSG_MEM_PTR, host address, length - 4096 - 1) -> D1
D3 send_replymsg(ID, MSG_MEM_ERROR, DMA address2 + length - 1, 1, status) -> D2
D2 send_replymsg(ID, MSG_MEM_ERROR, DMA address + length - 1, 1, status) -> D1

IRQ delivery chain D1->D2->D3 with coalescing, messages, delivery
reporting and EOI:
D1 send_msg(ID, MSG_IRQ_RAISE, payload) -> D2
D2 send_msg(ID, MSG_IRQ_RAISE, payload) -> D3
D3 send_replymsg(ID, MSG_IRQ_STATUS, MI_COALESCED) -> D2
D2 send_replymsg(ID, MSG_IRQ_STATUS, MI_COALESCED) -> D1
D3 send_replymsg(ID, MSG_IRQ_STATUS, MI_DELIVERED) -> D2
D2 send_replymsg(ID, MSG_IRQ_STATUS, MI_DELIVERED) -> D1
D3 send_replymsg(ID, MSG_IRQ_STATUS, MI_EOI) -> D2
D2 send_replymsg(ID, MSG_IRQ_STATUS, MI_EOI) -> D1

What else do we want? :-)
Blue Swirl - June 13, 2010, 6:47 a.m.
On Sat, Jun 12, 2010 at 8:32 PM, Blue Swirl <blauwirbel@gmail.com> wrote:
> On Sat, Jun 12, 2010 at 7:33 PM, Blue Swirl <blauwirbel@gmail.com> wrote:
>> On Sat, Jun 12, 2010 at 3:58 PM, Paul Brook <paul@codesourcery.com> wrote:
>>>> I think message passing interrupts
>>>> are only used in bus based systems, like PCI or UPA/JBUS etc. I don't
>>>> know how LAPIC/IOAPIC bus works, it could be similar.
>>>
>>> PCI Message Signalled Interrupts use a regular data write, and we model it
>>> exactly that way. Under normal circumstances you program the device to write
>>> to memory mapped APIC control registers, but there's no real reason why you
>>> couldn't write to RAM.
>>>
>>> LAPIC/IOAPIC have their own bus. On some hardware this is multiplexed over the
>>> main system bus, but logically it is a separate interconnect.
>>>
>>> The fact that these a bus based systems suggests against using qemu_irq, which
>>> is an inherently point-point system.
>>>
>>>> > How is a
>>>> > receiver meant to know for format of the message being delivered, and who
>>>> > it's intended for?
>>>>
>>>> This depends on the architecture. On Sparc64 the message is specified
>>>> by the guest or OF.
>>>>...
>>>> In real HW (MSI or Sparc64 mondo interrupts) a bus delivers the
>>>> message. But our buses only handle address and data lines.
>>>
>>> IIUC you're trying to use qemu_irq as a generic interconnect between devices.
>>> I think that's going to cause more problems than it solves.  If a bus has
>>> additional message passing capabilities then this can be part of the bus
>>> interface.
>>>
>>> If two devices need a private communication channel then we probably want some
>>> implementation agnostic way of defining this link.  Connecting LAPIC to IOAPIC
>>> is a potential example of this. However this needs to be done in a type-safe
>>> manner.  DMA channels may be another potential use of this linking. We'd want
>>> to avoid connecting a DMA channel to an APIC.
>>>
>>> [*] A simple unidirectional dma request line is suitable for qmu_irq. A DMA
>>> system that transfers data outside of memory read/write transactions is not.
>>> e.g. ISA effectively defines a regular memory bus plus 8 bidirectional data
>>> streams (aka DMA channels). These are multiplexed over the same physical pins,
>>> but logically distinct. The DMA channels could either be bundled up into the
>>> ISA bus interface, or modelled as links between devices and the DMA
>>> controller.
>>
>> Very very interesting. There's some out of band data related to DMA
>> (bus errors, IOMMU faults, ISA DREQ) which hasn't been covered in
>> previous Generic DMA proposals. Maybe all we need is a generic side
>> band link, which would be usable for point to point (qemu_irq,
>> coalescing, messages) and bus channels?
>
> I think we could solve all problems (well, maybe not world peace, yet)
> by switching to message based system for all of DMA and IRQs.
>
> Each device would have a message input port and way to output messages.
>
> Examples:
>
> Zero copy memory access from device D1 to D2 to host memory (D3) with
> access broken to page length units and errors occurring on the last
> byte:
> D1 send_msg(ID, MSG_MEM_WRITE, DMA address, length) -> D2
> D2 send_msg(ID, MSG_MEM_WRITE, DMA address2, length) -> D3
> D3 send_replymsg(ID, MSG_MEM_PTR, host address, 4096) -> D2
> D2 send_replymsg(ID, MSG_MEM_PTR, host address, 4096) -> D1
> D3 send_replymsg(ID, MSG_MEM_PTR, host address, length - 4096 - 1) -> D2
> D2 send_replymsg(ID, MSG_MEM_PTR, host address, length - 4096 - 1) -> D1
> D3 send_replymsg(ID, MSG_MEM_ERROR, DMA address2 + length - 1, 1, status) -> D2
> D2 send_replymsg(ID, MSG_MEM_ERROR, DMA address + length - 1, 1, status) -> D1

This could be implemented with small modifications to existing
cpu_physical_memory_rw() interface. Improved devices would check if a
channel back has been registered. If not, current methods are used,
same with unimproved devices. If yes, translated pointers will flow
back for zero copy DMA.

> IRQ delivery chain D1->D2->D3 with coalescing, messages, delivery
> reporting and EOI:
> D1 send_msg(ID, MSG_IRQ_RAISE, payload) -> D2
> D2 send_msg(ID, MSG_IRQ_RAISE, payload) -> D3
> D3 send_replymsg(ID, MSG_IRQ_STATUS, MI_COALESCED) -> D2
> D2 send_replymsg(ID, MSG_IRQ_STATUS, MI_COALESCED) -> D1
> D3 send_replymsg(ID, MSG_IRQ_STATUS, MI_DELIVERED) -> D2
> D2 send_replymsg(ID, MSG_IRQ_STATUS, MI_DELIVERED) -> D1
> D3 send_replymsg(ID, MSG_IRQ_STATUS, MI_EOI) -> D2
> D2 send_replymsg(ID, MSG_IRQ_STATUS, MI_EOI) -> D1

Same here: send_msg could be in fact be qemu_irq_raise(). Without a
return chain, no replies will be sent. For messages, an augmented
forward chain must exist, otherwise the messages will be ignored.
Paul Brook - June 13, 2010, 3:49 p.m.
> I think we could solve all problems (well, maybe not world peace, yet)
> by switching to message based system for all of DMA and IRQs.
> 
> Each device would have a message input port and way to output messages.
> 
> Examples:
> 
> Zero copy memory access from device D1 to D2 to host memory (D3) with
> access broken to page length units and errors occurring on the last
> byte:
> D1 send_msg(ID, MSG_MEM_WRITE, DMA address, length) -> D2
>...
> IRQ delivery chain D1->D2->D3 with coalescing, messages, delivery
> reporting and EOI:
> D1 send_msg(ID, MSG_IRQ_RAISE, payload) -> D2

This feels like a terrible idea to me. It introduces an unnecessary RPC 
indirection layer without actually solving any of the problems. It just makes 
it harder (if not impossible) for the compiler to verify any of the interfaces 
between objects.

If we need a mechanism for a device to query the state of its outgoing IRQ 
lines (which is what I mentioned previously), then we can just add that. I 
still maintain that qemu_irq is *not* a message passing interface, and that 
this absence is a good thing. Maybe if I make the qemu_irq code actively 
suppress redundant level change notices that would make things clearer.

Paul
Paul Brook - June 13, 2010, 4:34 p.m.
> TBH I preferred the original system whereby the source can query the state
> of the sink (i.e "are you ignoring this line?").  Note that conceptually
> this should be *querying* state, not responding to an event.

People are still pushing qemu_irq as an message passing interface, so I'm 
going to expand a bit more on how I think this could be avoided.

Start with the assumption that qemu irq represents a single bit of 
information. The current implementation is stateless, but in principle it 
could remember its state and ignore redundant calls to qemu_set_irq.

In order to workaround the periodic timer issue, we need some way of for the 
source device to interrogate the target device state relating to this link. 
The suggestions so far are along the lines of "what happened when I made this 
change". This makes me unhappy, because it's overlaying event semantics on top 
of a state based system.

Instead I suggest that we should be describing what the target state 
associated with this input is.  Suitable return values could be:
 * normal: This input effects the state of the target device. Note that this 
need not imply that changing the input actually effects the output at this 
time. e.g. if an interrupt controller is already processing a higher priority 
input, the low priority inputs should still return "normal" - the input will 
be processed once the unrelated high priority input is finished.
 * latched: This input has already effected target device state, and will be 
ignored until reset by some external event. Typically means an interrupt 
controller latches its inputs, and this input has already been latched.
 * masked: This input is ignored.

In practice these should give approximately the same information as event 
based delivered/coalesced/dropped responses.  The difference is that they are 
consistent with the state based nature of qemu_irq.

Paul
Blue Swirl - June 13, 2010, 6:04 p.m.
On Sun, Jun 13, 2010 at 4:34 PM, Paul Brook <paul@codesourcery.com> wrote:
>> TBH I preferred the original system whereby the source can query the state
>> of the sink (i.e "are you ignoring this line?").  Note that conceptually
>> this should be *querying* state, not responding to an event.
>
> People are still pushing qemu_irq as an message passing interface, so I'm
> going to expand a bit more on how I think this could be avoided.
>
> Start with the assumption that qemu irq represents a single bit of
> information. The current implementation is stateless, but in principle it
> could remember its state and ignore redundant calls to qemu_set_irq.
>
> In order to workaround the periodic timer issue, we need some way of for the
> source device to interrogate the target device state relating to this link.
> The suggestions so far are along the lines of "what happened when I made this
> change". This makes me unhappy, because it's overlaying event semantics on top
> of a state based system.
>
> Instead I suggest that we should be describing what the target state
> associated with this input is.  Suitable return values could be:
>  * normal: This input effects the state of the target device. Note that this
> need not imply that changing the input actually effects the output at this
> time. e.g. if an interrupt controller is already processing a higher priority
> input, the low priority inputs should still return "normal" - the input will
> be processed once the unrelated high priority input is finished.
>  * latched: This input has already effected target device state, and will be
> ignored until reset by some external event. Typically means an interrupt
> controller latches its inputs, and this input has already been latched.
>  * masked: This input is ignored.
>
> In practice these should give approximately the same information as event
> based delivered/coalesced/dropped responses.  The difference is that they are
> consistent with the state based nature of qemu_irq.

I agree this would be enough for the current RTC/APIC case.

This doesn't help in EOI handling or message passing in any way, but
we can make completely separate systems for those later when the need
arises.
Blue Swirl - June 13, 2010, 6:17 p.m.
On Sun, Jun 13, 2010 at 3:49 PM, Paul Brook <paul@codesourcery.com> wrote:
>> I think we could solve all problems (well, maybe not world peace, yet)
>> by switching to message based system for all of DMA and IRQs.
>>
>> Each device would have a message input port and way to output messages.
>>
>> Examples:
>>
>> Zero copy memory access from device D1 to D2 to host memory (D3) with
>> access broken to page length units and errors occurring on the last
>> byte:
>> D1 send_msg(ID, MSG_MEM_WRITE, DMA address, length) -> D2
>>...
>> IRQ delivery chain D1->D2->D3 with coalescing, messages, delivery
>> reporting and EOI:
>> D1 send_msg(ID, MSG_IRQ_RAISE, payload) -> D2
>
> This feels like a terrible idea to me. It introduces an unnecessary RPC
> indirection layer without actually solving any of the problems. It just makes
> it harder (if not impossible) for the compiler to verify any of the interfaces
> between objects.

For the memory access case, in practice the interface could be
sysbus_memory_rw(DeviceState *parent, target_phys_addr_t addr,
target_phys_addr_t size)
in place of send_msg() and
sysbus_memory_rw_cb(DeviceState *dev, void *ptr, size_t size, int status)
in place of send_replymsg() so we'd have compiler type checks.
Paul Brook - June 13, 2010, 6:39 p.m.
> On Sun, Jun 13, 2010 at 3:49 PM, Paul Brook <paul@codesourcery.com> wrote:
> >> I think we could solve all problems (well, maybe not world peace, yet)
> >> by switching to message based system for all of DMA and IRQs.
> >> 
> >> Each device would have a message input port and way to output messages.
> >> 
> >> Examples:
> >> 
> >> Zero copy memory access from device D1 to D2 to host memory (D3) with
> >> access broken to page length units and errors occurring on the last
> >> byte:
> >> D1 send_msg(ID, MSG_MEM_WRITE, DMA address, length) -> D2
> >>
> >>...
> >>
> >> IRQ delivery chain D1->D2->D3 with coalescing, messages, delivery
> >> reporting and EOI:
> >> D1 send_msg(ID, MSG_IRQ_RAISE, payload) -> D2
> > 
> > This feels like a terrible idea to me. It introduces an unnecessary RPC
> > indirection layer without actually solving any of the problems. It just
> > makes it harder (if not impossible) for the compiler to verify any of
> > the interfaces between objects.
> 
> For the memory access case, in practice the interface could be
> sysbus_memory_rw(DeviceState *parent, target_phys_addr_t addr,
> target_phys_addr_t size)

Why "parent"?

> in place of send_msg() and
> sysbus_memory_rw_cb(DeviceState *dev, void *ptr, size_t size, int status)
> in place of send_replymsg() so we'd have compiler type checks.

I don't see any point point trying to squeeze this through a common message 
passing API. We *could* do that if we really wanted, but It's a lot of hassle. 
If devices are going to end up using wrappers that look a lot like a straight 
API then what's the point?

Paul
Blue Swirl - June 13, 2010, 6:54 p.m.
On Sun, Jun 13, 2010 at 6:39 PM, Paul Brook <paul@codesourcery.com> wrote:
>> On Sun, Jun 13, 2010 at 3:49 PM, Paul Brook <paul@codesourcery.com> wrote:
>> >> I think we could solve all problems (well, maybe not world peace, yet)
>> >> by switching to message based system for all of DMA and IRQs.
>> >>
>> >> Each device would have a message input port and way to output messages.
>> >>
>> >> Examples:
>> >>
>> >> Zero copy memory access from device D1 to D2 to host memory (D3) with
>> >> access broken to page length units and errors occurring on the last
>> >> byte:
>> >> D1 send_msg(ID, MSG_MEM_WRITE, DMA address, length) -> D2
>> >>
>> >>...
>> >>
>> >> IRQ delivery chain D1->D2->D3 with coalescing, messages, delivery
>> >> reporting and EOI:
>> >> D1 send_msg(ID, MSG_IRQ_RAISE, payload) -> D2
>> >
>> > This feels like a terrible idea to me. It introduces an unnecessary RPC
>> > indirection layer without actually solving any of the problems. It just
>> > makes it harder (if not impossible) for the compiler to verify any of
>> > the interfaces between objects.
>>
>> For the memory access case, in practice the interface could be
>> sysbus_memory_rw(DeviceState *parent, target_phys_addr_t addr,
>> target_phys_addr_t size)
>
> Why "parent"?

Parent device or bus host bridge device. Alternatively there could be
a bus handle.

>> in place of send_msg() and
>> sysbus_memory_rw_cb(DeviceState *dev, void *ptr, size_t size, int status)
>> in place of send_replymsg() so we'd have compiler type checks.
>
> I don't see any point point trying to squeeze this through a common message
> passing API. We *could* do that if we really wanted, but It's a lot of hassle.
> If devices are going to end up using wrappers that look a lot like a straight
> API then what's the point?

Without IRQ and other messages, I agree that at this point it would
not make much sense. Do you think this API as straight implementation
would work? Compared to previous attempts for Generic DMA, we'd avoid
any IO vector splitting/reallocations etc.
Paul Brook - June 13, 2010, 7:38 p.m.
> >> For the memory access case, in practice the interface could be
> >> sysbus_memory_rw(DeviceState *parent, target_phys_addr_t addr,
> >> target_phys_addr_t size)
> > 
> > Why "parent"?
> 
> Parent device or bus host bridge device. Alternatively there could be
> a bus handle.

A device has no way of knowing where that bust host bridge is, and this may be 
the wrong device anyway. The access may be directed at another device on the 
same bus.  A bus handle is trivial to derive this from the device itself.

> > I don't see any point point trying to squeeze this through a common
> > message passing API. We *could* do that if we really wanted, but It's a
> > lot of hassle. If devices are going to end up using wrappers that look a
> > lot like a straight API then what's the point?
> 
> Without IRQ and other messages, I agree that at this point it would
> not make much sense. Do you think this API as straight implementation
> would work? Compared to previous attempts for Generic DMA, we'd avoid
> any IO vector splitting/reallocations etc.

We have already merged a "generic DMA" infrastructure 
(cpu_physical_memory_map). See also dma-helpers.c for code that uses this to 
implement zero-copy transfers to/from block devices.

This is missing code to allow manipulation by intermediate bus/bridges, but I 
believe the basic API is fairly sound. Addition of a device handle or error 
reporting should be possible without requiring major changes to the device 
code.

Paul
Gleb Natapov - June 14, 2010, 5:40 a.m.
On Sun, Jun 13, 2010 at 05:34:24PM +0100, Paul Brook wrote:
> > TBH I preferred the original system whereby the source can query the state
> > of the sink (i.e "are you ignoring this line?").  Note that conceptually
> > this should be *querying* state, not responding to an event.
> 
> People are still pushing qemu_irq as an message passing interface, so I'm 
> going to expand a bit more on how I think this could be avoided.
> 
Please, please, please write code and show how it solves existing
problems. Code talks much better then words.

> Start with the assumption that qemu irq represents a single bit of 
> information. The current implementation is stateless, but in principle it 
> could remember its state and ignore redundant calls to qemu_set_irq.
> 
And I don't see why it will affect proposed implementation. Device
should not call qemu_set_irq() with the same state twice anyway. Some
device's implementations are lazy so they don't track previous state, but
this is just implementation detail. 

> In order to workaround the periodic timer issue, we need some way of for the 
> source device to interrogate the target device state relating to this link. 
> The suggestions so far are along the lines of "what happened when I made this 
> change". This makes me unhappy, because it's overlaying event semantics on top 
> of a state based system.
> 
> Instead I suggest that we should be describing what the target state 
> associated with this input is.  Suitable return values could be:
>  * normal: This input effects the state of the target device. Note that this 
> need not imply that changing the input actually effects the output at this 
> time. e.g. if an interrupt controller is already processing a higher priority 
> input, the low priority inputs should still return "normal" - the input will 
> be processed once the unrelated high priority input is finished.
>  * latched: This input has already effected target device state, and will be 
> ignored until reset by some external event. Typically means an interrupt 
> controller latches its inputs, and this input has already been latched.
>  * masked: This input is ignored.
> 
> In practice these should give approximately the same information as event 
> based delivered/coalesced/dropped responses.  The difference is that they are 
> consistent with the state based nature of qemu_irq.
> 
You just described what was proposed by Jan, you just not agree that it
should be done on top of qemu_irq without proposing any alternatives.
The problem is that device doesn't know what its target is and it can have
multiple targets (that is why hack that exist today works only for one
particular configuration as Blue Swirl found out the hard way). qemu_irq
traverses the dynamic chain of links that connects source and its destinations,
and thus device itself doesn't need to know what its destinations are.

--
			Gleb.

Patch

diff --git a/hw/irq.c b/hw/irq.c
index 24fb09d..db5136b 100644
--- a/hw/irq.c
+++ b/hw/irq.c
@@ -28,12 +28,31 @@  struct IRQState {
     qemu_irq_handler handler;
     void *opaque;
     int n;
+    IRQMsg *msg;
 };
 
-void qemu_set_irq(qemu_irq irq, int level)
+void qemu_set_irq_msg(qemu_irq irq, int level, IRQMsg *msg)
 {
     if (irq) {
+        irq->msg = msg;
         irq->handler(irq, irq->opaque, irq->n, level);
+        irq->msg = NULL;
+    }
+}
+
+void *qemu_irq_get_payload(qemu_irq irq)
+{
+    IRQMsg *msg = irq->msg;
+
+    return msg ? msg->payload : NULL;
+}
+
+void qemu_irq_fire_delivery_cb(qemu_irq irq, int level, int result)
+{
+    IRQMsg *msg = irq->msg;
+
+    if (msg && msg->delivery_cb) {
+        msg->delivery_cb(irq, msg->delivery_opaque, irq->n, level, result);
     }
 }
 
@@ -61,11 +80,27 @@  void qemu_free_irqs(qemu_irq *s)
     qemu_free(s);
 }
 
+static void qemu_notirq_delivery_cb(qemu_irq irq, void *opaque, int line,
+                                    int level, int result)
+{
+    qemu_irq orig_irq = opaque;
+
+    qemu_irq_fire_delivery_cb(orig_irq, !level, result);
+}
+
 static void qemu_notirq(qemu_irq irq, void *opaque, int line, int level)
 {
     struct IRQState *inv_irq = opaque;
+    IRQMsg msg;
 
+    if (irq->msg) {
+        msg.delivery_cb = qemu_notirq_delivery_cb;
+        msg.delivery_opaque = irq;
+        msg.payload = irq->msg->payload;
+        inv_irq->msg = &msg;
+    }
     inv_irq->handler(inv_irq, inv_irq->opaque, inv_irq->n, !level);
+    inv_irq->msg = NULL;
 }
 
 qemu_irq qemu_irq_invert(qemu_irq irq)
diff --git a/hw/irq.h b/hw/irq.h
index d0f83e3..01f96af 100644
--- a/hw/irq.h
+++ b/hw/irq.h
@@ -3,26 +3,62 @@ 
 
 /* Generic IRQ/GPIO pin infrastructure.  */
 
+#define QEMU_IRQ_DELIVERED      0
+#define QEMU_IRQ_COALESCED      (-1)
+#define QEMU_IRQ_MASKED         (-2)
+
+typedef void (*qemu_irq_delivery_cb)(qemu_irq irq, void *opaque, int n,
+                                     int level, int result);
 typedef void (*qemu_irq_handler)(qemu_irq irq, void *opaque, int n, int level);
 
-void qemu_set_irq(qemu_irq irq, int level);
+typedef struct IRQMsg {
+    qemu_irq_delivery_cb delivery_cb;
+    void *delivery_opaque;
+    void *payload;
+} IRQMsg;
+
+void qemu_set_irq_msg(qemu_irq irq, int level, IRQMsg *msg);
+
+static inline void qemu_set_irq(qemu_irq irq, int level)
+{
+    qemu_set_irq_msg(irq, level, NULL);
+}
 
 static inline void qemu_irq_raise(qemu_irq irq)
 {
     qemu_set_irq(irq, 1);
 }
 
+static inline void qemu_irq_raise_msg(qemu_irq irq, IRQMsg *msg)
+{
+    qemu_set_irq_msg(irq, 1, msg);
+}
+
 static inline void qemu_irq_lower(qemu_irq irq)
 {
     qemu_set_irq(irq, 0);
 }
 
+static inline void qemu_irq_lower_msg(qemu_irq irq, IRQMsg *msg)
+{
+    qemu_set_irq_msg(irq, 0, msg);
+}
+
 static inline void qemu_irq_pulse(qemu_irq irq)
 {
     qemu_set_irq(irq, 1);
     qemu_set_irq(irq, 0);
 }
 
+static inline void qemu_irq_pulse_msg(qemu_irq irq, IRQMsg *msg)
+{
+    qemu_set_irq_msg(irq, 1, msg);
+    qemu_set_irq_msg(irq, 0, msg);
+}
+
+void qemu_irq_fire_delivery_cb(qemu_irq irq, int level, int result);
+void *qemu_irq_get_payload(qemu_irq irq);
+
 /* Returns an array of N IRQs.  */
 qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n);
 void qemu_free_irqs(qemu_irq *s);