Patchwork [v4,5/5] i8259: fix dynamically masking slave IRQs with IMR register

login
register
mail settings
Submitter Matthew Ogilvie
Date Sept. 3, 2012, 2:56 a.m.
Message ID <1346640974-30974-6-git-send-email-mmogilvi_qemu@miniinfo.net>
Download mbox | patch
Permalink /patch/181283/
State New
Headers show

Comments

Matthew Ogilvie - Sept. 3, 2012, 2:56 a.m.
Although I haven't found any specs that say so, on real hardware
I have a test program that shows if you mask off the slave
interrupt (say IRQ14) in the IMR after it has already been raised,
the master (IRQ2) gets canceled (that is, IRQ2 acts like it is level
triggered).  Without this patch, qemu delivers a
spurious interrupt (IRQ15) instead when running the test program.

Signed-off-by: Matthew Ogilvie <mmogilvi_qemu@miniinfo.net>
---

I've written a test program (in the form of a floppy disk boot sector)
that demonstrates that qemu's emulation of IRQ2 propagation from the
slave i8259 to the master does not work correctly when the CPU has
interrupts disabled and it masks off the original interrupt (IRQ14)
in the slave's IMR register.  This was based on simplifying breakage
observed when trying to run an old Microport UNIX system (ca 1987).

Earlier I speculated that maybe the ELCR bit for IRQ2 was incorrect,
but now I don't think changing the bit (from the target's
perspective) would be a good idea.  See below.

You can download the source code for the test program from
http://home.comcast.net/~mmogilvi/downloads/i8259-imr-test-2012-09-02.tar.bz2
It can be compiled using a standard GNU i386 toolchain on Linux.

The heart of the test program is:

        cli

          # i8259:imm: mask off everything except IRQ2
        movb $0xfb,%al     # master (only IRQ2 clear)
        outb %al,$0x21
        movb $0xff,%al     # slave
        outb %al,$0xa1

        mov $.msgCmdRead,%ax
        call print
        call initIrqHandlers
        call scheduleIrq14

        call .largeDelay   # note: IRQ14 raised while this is waiting

        mov $.msgUnmask,%ax
        call print
        movb $0x3f,%al     # unmask IRQ14 and IRQ15
        outb %al,$0xa1

        call .largeDelay   # (probably not important)

        mov $.msgMask,%ax
        call print
        movb $0xff,%al     # mask IRQ14 and IRQ15 again
        outb %al,$0xa1

        call .largeDelay   # (probably not important)

        mov $.msgSti,%ax
        call print
        sti

        call .largeDelay   # note: spurious interrupt (IRQ15) from qemu

        cli
        mov $.msgUnmask,%ax
        call print
        movb $0x3f,%al     # unmask IRQ14 and IRQ15
        outb %al,$0xa1
        sti

        call .largeDelay   # we should finally see IRQ14 here?

          # DONE:
        cli
        movw $.msgDone, %ax
        call print
  .Lhalt:
        hlt
        jmp .Lhalt

In qemu, the master treats IRQ2 as edge triggered, and delivers a
spurious interrupt if the CPU re-enables interrupts after
the slave's IMR is masked off (it also doesn't seem to deliver
the real interrupt when the IMR is unmasked, but maybe that is
because I'm not correctly acknowledging the spurious interrupt).

  - Qemu output (without this patch):
        elcr=0c00 cmdRead ummask mask sti irq15 unmask DONE

But on real hardware, the master seems to treat IRQ2 as level triggered,
and doesn't deliver an interrupt to the CPU until the slave unmasks IRQ14.
I've tried this on several machines, including a non-PCI 486 that
does not seem to have ELCR registers.

  - Typical output (pentium/pentium pro/dual AMD Athlon/etc; slight
    variations in some elcr bits depending on machine, but bit 0x04
    is always clear):
        elcr=0c20 cmdRead unmask mask sti unmask irq14 DONE

  - 486 without elcr (just an ISA bus):
        elcr=e0e0 cmdRead unmask mask sti unmask irq14 DONE

  - One failure: It doesn't boot properly (no output) with a USB floppy
    drive on my Intel Core I7.  Guess: The test program just barely
    fits in a sector, with no room for any tables (partition/etc)
    that BIOS might check for if it isn't an original, native floppy
    drive.

-----------------------

I've found a few descriptions of programming the i8259.
The closest thing I've found to a detailed spec is in
"iAPX 86, 88 User's Manual", dated August 1981:
http://ebookbrowse.com/1981-iapx-86-88-users-manual-pdf-d3089962
It has a significant section titled "Using the 8259A Programmable
Interrupt Controller" starting on page 438 or A-135.

But none of my sources seem to specify how master/slave cascading
interacts with the IMR (mask register) and edge trigger mode,
although it talks about those things individually.
Also, given the date it isn't surprising that it doesn't mention
the elcr registers at all.

I have not found any real specs for the ELCR registers, but I've found a
few hints:

  - Two 8 bit registers: one for master (0x4d0) and one for slave (0x4d1).
  - One bit per IRQ line: 0 for edge trigger, 1 for level trigger.
  - Not present unless the machine has EISA or PCI buses.  Plain
    ISA doesn't have it.
  - Might be implemented completely external to the actual i8259s.
  - As seen in test program output above, ELCR bit 0x04 is clear,
    indicating that IRQ2 is supposedly edge triggered, even though
    actual tested behavior is level triggered.
  - A google search (8259 elcr -linux -qemu) [exclude the
    dominant Linux and qemu related pages] found at least one operating
    system that checks several ELCR bits (including IRQ2) as part
    of a probe to determine if ELCR exists.  So simply setting
    the 0x04 bit is probably a bad idea.  For example, line 119 of:
    https://bitbucket.org/npe/nix/src/35d39df17077/src/9kron2/k8/i8259.c

-----------------------

My guess is that if a master IRQ line is marked as coming from a slave
(in the ICW3 register), then the i8259 treats that line as level
triggered regardless of ELCR registers or the LTIM bit of ICW1 (in
addition to other special slave line logic).  Otherwise, the slave
IMR register would seem rather useless.

Under that theory, something like the following patch would be appropriate
for qemu.  This fixes both the test program, and the original Microport
UNIX problem.

Possible concerns with this patch:

  - KVM still needs to be fixed.  I haven't researched this at all,
    beyond noticing that it has the same broken behavior running the
    test program, and the high level symptoms from Microport UNIX
    are slightly different with KVM.
  - It might also be a good idea to add something like my test
    program to qemu/tests somewhere.  This would be a separate patch.
  - Best icw3 configuration strategy?: Should it be stored, or
    just assume it is correctly configured based on
    PICCommonState::master and standard PC IRQ2 configuration?
    Perhaps add sanity checks for reasonable values when the
    guest is configuring the 8259s?  Did I catch all the places
    that need to deal with a new state variable (assuming we
    decide to store it)?
  - This patch is adding code to what may be a relatively common
    code path (every time interrupts raised/lowered/acknowledged, masks
    changed, etc).  Potentially it could be optimized by adding an
    "effective_elcr" state variable, that is maintined as a combination
    of icw3 and elcr (and maybe add support for LTIM bit from ICW1 at the
    same time), so that we don't need to have extra code in the
    critical path.  Does anyone think this is worth it?  I'm leaning
    towards "no".
Paolo Bonzini - Sept. 3, 2012, 7:08 a.m.
Il 03/09/2012 04:56, Matthew Ogilvie ha scritto:
> Although I haven't found any specs that say so, on real hardware
> I have a test program that shows if you mask off the slave
> interrupt (say IRQ14) in the IMR after it has already been raised,
> the master (IRQ2) gets canceled (that is, IRQ2 acts like it is level
> triggered).  Without this patch, qemu delivers a
> spurious interrupt (IRQ15) instead when running the test program.
> 
> Signed-off-by: Matthew Ogilvie <mmogilvi_qemu@miniinfo.net>

Nice testing, thanks!

KVM's i8259 emulation can be patched at arch/x86/kvm/i8259.c in the
Linux tree.

You can write a test for kvm-unit-tests.  These tests can be written in
C, including interrupt handlers (see lib/x86/isr.h).  The git tree is at
git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git.

Paolo
Andreas Färber - Sept. 3, 2012, 8:40 a.m.
Am 03.09.2012 04:56, schrieb Matthew Ogilvie:
> diff --git a/hw/i8259_common.c b/hw/i8259_common.c
> index ab3d98b..dcde5f2 100644
> --- a/hw/i8259_common.c
> +++ b/hw/i8259_common.c
[...]
> @@ -111,6 +112,7 @@ static const VMStateDescription vmstate_pic_common = {
>          VMSTATE_UINT8(isr, PICCommonState),
>          VMSTATE_UINT8(priority_add, PICCommonState),
>          VMSTATE_UINT8(irq_base, PICCommonState),
> +        VMSTATE_UINT8(icw3, PICCommonState),
>          VMSTATE_UINT8(read_reg_select, PICCommonState),
>          VMSTATE_UINT8(poll, PICCommonState),
>          VMSTATE_UINT8(special_mask, PICCommonState),

Additional VMState needs to be versioned by incrementing .version_id and
by specifying the new version number here. Otherwise it breaks migration.

Regards,
Andreas
Jan Kiszka - Sept. 3, 2012, 8:51 a.m.
On 2012-09-03 04:56, Matthew Ogilvie wrote:
> Although I haven't found any specs that say so, on real hardware
> I have a test program that shows if you mask off the slave
> interrupt (say IRQ14) in the IMR after it has already been raised,
> the master (IRQ2) gets canceled (that is, IRQ2 acts like it is level
> triggered).  Without this patch, qemu delivers a
> spurious interrupt (IRQ15) instead when running the test program.
> 
> Signed-off-by: Matthew Ogilvie <mmogilvi_qemu@miniinfo.net>
> ---
> 
> I've written a test program (in the form of a floppy disk boot sector)
> that demonstrates that qemu's emulation of IRQ2 propagation from the
> slave i8259 to the master does not work correctly when the CPU has
> interrupts disabled and it masks off the original interrupt (IRQ14)
> in the slave's IMR register.  This was based on simplifying breakage
> observed when trying to run an old Microport UNIX system (ca 1987).
> 
> Earlier I speculated that maybe the ELCR bit for IRQ2 was incorrect,
> but now I don't think changing the bit (from the target's
> perspective) would be a good idea.  See below.

Yes, you cannot set IRQ0..2 to level triggered mode on modern chipsets,
look e.g. at the ICH9 spec:

"In edge mode, (bit[x] = 0), the interrupt is recognized by a low to
high transition. In level mode (bit[x] = 1), the interrupt is recognized
by a high level. The cascade channel, IRQ2, the heart beat timer (IRQ0),
and the keyboard controller (IRQ1), cannot be put into level mode."

So, fiddling with ELCR from guest perspective cannot be the solution.

> 
> You can download the source code for the test program from
> http://home.comcast.net/~mmogilvi/downloads/i8259-imr-test-2012-09-02.tar.bz2
> It can be compiled using a standard GNU i386 toolchain on Linux.

As Paolo said, we are better off with a KVM unit test. That should be
loadable via grub & co. on real HW as well.

> 
> The heart of the test program is:
> 

Initialization is also important to exclude leftovers of the BIOS.

>         cli
> 
>           # i8259:imm: mask off everything except IRQ2
>         movb $0xfb,%al     # master (only IRQ2 clear)
>         outb %al,$0x21
>         movb $0xff,%al     # slave
>         outb %al,$0xa1
> 
>         mov $.msgCmdRead,%ax
>         call print
>         call initIrqHandlers
>         call scheduleIrq14
> 
>         call .largeDelay   # note: IRQ14 raised while this is waiting
> 
>         mov $.msgUnmask,%ax
>         call print
>         movb $0x3f,%al     # unmask IRQ14 and IRQ15
>         outb %al,$0xa1
> 
>         call .largeDelay   # (probably not important)
> 
>         mov $.msgMask,%ax
>         call print
>         movb $0xff,%al     # mask IRQ14 and IRQ15 again
>         outb %al,$0xa1

And now check IRR of the master by reading it back. That would be a
smoking gun if bit 2 is set before and cleared afterward.

> 
>         call .largeDelay   # (probably not important)
> 
>         mov $.msgSti,%ax
>         call print
>         sti
> 
>         call .largeDelay   # note: spurious interrupt (IRQ15) from qemu
> 
>         cli
>         mov $.msgUnmask,%ax
>         call print
>         movb $0x3f,%al     # unmask IRQ14 and IRQ15
>         outb %al,$0xa1
>         sti
> 
>         call .largeDelay   # we should finally see IRQ14 here?
> 
>           # DONE:
>         cli
>         movw $.msgDone, %ax
>         call print
>   .Lhalt:
>         hlt
>         jmp .Lhalt
> 
> In qemu, the master treats IRQ2 as edge triggered, and delivers a
> spurious interrupt if the CPU re-enables interrupts after
> the slave's IMR is masked off (it also doesn't seem to deliver
> the real interrupt when the IMR is unmasked, but maybe that is
> because I'm not correctly acknowledging the spurious interrupt).

Yes, that is required.

> 
>   - Qemu output (without this patch):
>         elcr=0c00 cmdRead ummask mask sti irq15 unmask DONE
> 
> But on real hardware, the master seems to treat IRQ2 as level triggered,
> and doesn't deliver an interrupt to the CPU until the slave unmasks IRQ14.
> I've tried this on several machines, including a non-PCI 486 that
> does not seem to have ELCR registers.
> 
>   - Typical output (pentium/pentium pro/dual AMD Athlon/etc; slight
>     variations in some elcr bits depending on machine, but bit 0x04
>     is always clear):
>         elcr=0c20 cmdRead unmask mask sti unmask irq14 DONE
> 
>   - 486 without elcr (just an ISA bus):
>         elcr=e0e0 cmdRead unmask mask sti unmask irq14 DONE
> 
>   - One failure: It doesn't boot properly (no output) with a USB floppy
>     drive on my Intel Core I7.  Guess: The test program just barely
>     fits in a sector, with no room for any tables (partition/etc)
>     that BIOS might check for if it isn't an original, native floppy
>     drive.
> 
> -----------------------
> 
> I've found a few descriptions of programming the i8259.

I was mostly following the official "8259A PROGRAMMABLE INTERRUPT
CONTROLLER (8259A 8259A-2)" by Intel. But it also doesn't mention any
trigger mode changes for the cascading input lines.

> The closest thing I've found to a detailed spec is in
> "iAPX 86, 88 User's Manual", dated August 1981:
> http://ebookbrowse.com/1981-iapx-86-88-users-manual-pdf-d3089962
> It has a significant section titled "Using the 8259A Programmable
> Interrupt Controller" starting on page 438 or A-135.
> 
> But none of my sources seem to specify how master/slave cascading
> interacts with the IMR (mask register) and edge trigger mode,
> although it talks about those things individually.
> Also, given the date it isn't surprising that it doesn't mention
> the elcr registers at all.
> 
> I have not found any real specs for the ELCR registers, but I've found a
> few hints:

Intel chipset docs describe it, though only briefly.

> 
>   - Two 8 bit registers: one for master (0x4d0) and one for slave (0x4d1).
>   - One bit per IRQ line: 0 for edge trigger, 1 for level trigger.
>   - Not present unless the machine has EISA or PCI buses.  Plain
>     ISA doesn't have it.
>   - Might be implemented completely external to the actual i8259s.
>   - As seen in test program output above, ELCR bit 0x04 is clear,
>     indicating that IRQ2 is supposedly edge triggered, even though
>     actual tested behavior is level triggered.
>   - A google search (8259 elcr -linux -qemu) [exclude the
>     dominant Linux and qemu related pages] found at least one operating
>     system that checks several ELCR bits (including IRQ2) as part
>     of a probe to determine if ELCR exists.  So simply setting
>     the 0x04 bit is probably a bad idea.  For example, line 119 of:
>     https://bitbucket.org/npe/nix/src/35d39df17077/src/9kron2/k8/i8259.c
> 
> -----------------------
> 
> My guess is that if a master IRQ line is marked as coming from a slave
> (in the ICW3 register), then the i8259 treats that line as level
> triggered regardless of ELCR registers or the LTIM bit of ICW1 (in
> addition to other special slave line logic).  Otherwise, the slave
> IMR register would seem rather useless.

That IMR is surely not useless, even if we do not automatically revoke
the IRQ2 event. It would just require more careful unmasking in that case.

> 
> Under that theory, something like the following patch would be appropriate
> for qemu.  This fixes both the test program, and the original Microport
> UNIX problem.
> 
> Possible concerns with this patch:
> 
>   - KVM still needs to be fixed.  I haven't researched this at all,
>     beyond noticing that it has the same broken behavior running the
>     test program, and the high level symptoms from Microport UNIX
>     are slightly different with KVM.
>   - It might also be a good idea to add something like my test
>     program to qemu/tests somewhere.  This would be a separate patch.
>   - Best icw3 configuration strategy?: Should it be stored, or
>     just assume it is correctly configured based on
>     PICCommonState::master and standard PC IRQ2 configuration?

Let's avoid adding more logic that assumes cascading IRQ == 2.

>     Perhaps add sanity checks for reasonable values when the
>     guest is configuring the 8259s?  Did I catch all the places

No need for sanity checks if the logic is generic enough.

>     that need to deal with a new state variable (assuming we
>     decide to store it)?
>   - This patch is adding code to what may be a relatively common
>     code path (every time interrupts raised/lowered/acknowledged, masks
>     changed, etc).  Potentially it could be optimized by adding an
>     "effective_elcr" state variable, that is maintined as a combination
>     of icw3 and elcr (and maybe add support for LTIM bit from ICW1 at the
>     same time), so that we don't need to have extra code in the
>     critical path.  Does anyone think this is worth it?  I'm leaning
>     towards "no".

No need for such a micro-optimization.

> 
> ==========================================
> 
>  hw/i8259.c          | 12 ++++++++----
>  hw/i8259_common.c   |  2 ++
>  hw/i8259_internal.h |  1 +
>  3 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/i8259.c b/hw/i8259.c
> index 6587666..8e2f9f4 100644
> --- a/hw/i8259.c
> +++ b/hw/i8259.c
> @@ -140,7 +140,7 @@ static void pic_set_irq(void *opaque, int irq, int level)
>      }
>  #endif
>  
> -    if (s->elcr & mask) {
> +    if ((s->elcr | s->icw3) & mask) {
>          /* level triggered */
>          if (level) {
>              s->irr |= mask;
> @@ -174,7 +174,7 @@ static void pic_intack(PICCommonState *s, int irq)
>          s->isr |= (1 << irq);
>      }
>      /* We don't clear a level sensitive interrupt here */
> -    if (!(s->elcr & (1 << irq))) {
> +    if (!((s->elcr | s->icw3) & (1 << irq))) {
>          s->irr &= ~(1 << irq);
>      }
>      pic_update_irq(s);
> @@ -314,6 +314,10 @@ static void pic_ioport_write(void *opaque, target_phys_addr_t addr64,
>              s->init_state = s->single_mode ? (s->init4 ? 3 : 0) : 2;
>              break;
>          case 2:
> +            s->icw3 = val;
> +            /* TODO: Enforce constraints?: Master is typically
> +             *   0x04 for IRQ2 (maybe 0 is also OK).  Slave must be 0.
> +             */

No constraints, see above (and slave must be 2, its ID == cascading IRQ#).

>              if (s->init4) {
>                  s->init_state = 3;
>              } else {
> @@ -419,9 +423,9 @@ void pic_info(Monitor *mon)
>      for (i = 0; i < 2; i++) {
>          s = i == 0 ? DO_UPCAST(PICCommonState, dev.qdev, isa_pic) : slave_pic;
>          monitor_printf(mon, "pic%d: irr=%02x imr=%02x isr=%02x hprio=%d "
> -                       "irq_base=%02x rr_sel=%d elcr=%02x fnm=%d\n",
> +                       "irq_base=%02x icw3=%02x rr_sel=%d elcr=%02x fnm=%d\n",
>                         i, s->irr, s->imr, s->isr, s->priority_add,
> -                       s->irq_base, s->read_reg_select, s->elcr,
> +                       s->irq_base, s->icw3, s->read_reg_select, s->elcr,
>                         s->special_fully_nested_mode);
>      }
>  }
> diff --git a/hw/i8259_common.c b/hw/i8259_common.c
> index ab3d98b..dcde5f2 100644
> --- a/hw/i8259_common.c
> +++ b/hw/i8259_common.c
> @@ -33,6 +33,7 @@ void pic_reset_common(PICCommonState *s)
>      s->isr = 0;
>      s->priority_add = 0;
>      s->irq_base = 0;
> +    s->icw3 = 0;
>      s->read_reg_select = 0;
>      s->poll = 0;
>      s->special_mask = 0;
> @@ -111,6 +112,7 @@ static const VMStateDescription vmstate_pic_common = {
>          VMSTATE_UINT8(isr, PICCommonState),
>          VMSTATE_UINT8(priority_add, PICCommonState),
>          VMSTATE_UINT8(irq_base, PICCommonState),
> +        VMSTATE_UINT8(icw3, PICCommonState),

Let's add this as an optional subsection, only written when it's not
0x04 (for a master) or 0x2 (for a slave). See target-i386/machine.c for
examples, or read docs/migration.txt. That will mean you need to set
icw3 to 0x4 before loading a vmstate (=> pre_load handler).

The effect is that, generally, no additional state will be written in
practice - unless we save right after reset or some guest messes its PIC
configuration up.

>          VMSTATE_UINT8(read_reg_select, PICCommonState),
>          VMSTATE_UINT8(poll, PICCommonState),
>          VMSTATE_UINT8(special_mask, PICCommonState),
> diff --git a/hw/i8259_internal.h b/hw/i8259_internal.h
> index 4137b61..de67d49 100644
> --- a/hw/i8259_internal.h
> +++ b/hw/i8259_internal.h
> @@ -55,6 +55,7 @@ struct PICCommonState {
>      uint8_t isr; /* interrupt service register */
>      uint8_t priority_add; /* highest irq priority */
>      uint8_t irq_base;
> +    uint8_t icw3; /* bit set if line is connected to a slave */

Comment is wrong. This word is used for masters and slaves.

>      uint8_t read_reg_select;
>      uint8_t poll;
>      uint8_t special_mask;
> 

The only thing that worries me is that we just consider the PC so far
while the i8259 is also used on different architectures (PPC, MIPS, Alpha?).

Jan
Jan Kiszka - Sept. 3, 2012, 8:53 a.m.
On 2012-09-03 10:51, Jan Kiszka wrote:
>> diff --git a/hw/i8259_common.c b/hw/i8259_common.c
>> index ab3d98b..dcde5f2 100644
>> --- a/hw/i8259_common.c
>> +++ b/hw/i8259_common.c
>> @@ -33,6 +33,7 @@ void pic_reset_common(PICCommonState *s)
>>      s->isr = 0;
>>      s->priority_add = 0;
>>      s->irq_base = 0;
>> +    s->icw3 = 0;
>>      s->read_reg_select = 0;
>>      s->poll = 0;
>>      s->special_mask = 0;
>> @@ -111,6 +112,7 @@ static const VMStateDescription vmstate_pic_common = {
>>          VMSTATE_UINT8(isr, PICCommonState),
>>          VMSTATE_UINT8(priority_add, PICCommonState),
>>          VMSTATE_UINT8(irq_base, PICCommonState),
>> +        VMSTATE_UINT8(icw3, PICCommonState),
> 
> Let's add this as an optional subsection, only written when it's not
> 0x04 (for a master) or 0x2 (for a slave). See target-i386/machine.c for
> examples, or read docs/migration.txt. That will mean you need to set
> icw3 to 0x4 before loading a vmstate (=> pre_load handler).

Oh, and this code affects also the kvm-pic. So you have to maintain icw3
for that model as well, setting it to the PC defaults. Please test KVM
after your changes, including migration.

Jan
Paolo Bonzini - Sept. 3, 2012, 9:34 a.m.
Il 03/09/2012 10:51, Jan Kiszka ha scritto:
> The only thing that worries me is that we just consider the PC so far
> while the i8259 is also used on different architectures (PPC, MIPS, Alpha?).

Why is this a problem?  All of them use IRQ2 for a cascade, and initialize
icw3 to 0x4/0x2 (I checked OpenBIOS, rth's palcode for Alpha, and Linux).

BTW, from the palcode it looks like Alpha wants LTIM=1, so it would be nice
to implement that one as well:

  /* ??? MILO initializes the PIC as edge triggered; I do not know how SRM
     initializes them.  However, Linux seems to expect that these are level
     triggered.  That may be a kernel bug, but level triggers are more
     reliable anyway so lets go with that.  */

  /* Initialize level triggers.  The CY82C693UB that's on real alpha
     hardware doesn't have this; this is a PIIX extension.  However,
     QEMU doesn't implement regular level triggers.  */
  outb(0xff, PORT_PIC2_ELCR);
  outb(0xff, PORT_PIC1_ELCR);

Paolo
Jan Kiszka - Sept. 3, 2012, 10:34 a.m.
On 2012-09-03 11:34, Paolo Bonzini wrote:
> Il 03/09/2012 10:51, Jan Kiszka ha scritto:
>> The only thing that worries me is that we just consider the PC so far
>> while the i8259 is also used on different architectures (PPC, MIPS, Alpha?).
> 
> Why is this a problem?  All of them use IRQ2 for a cascade, and initialize
> icw3 to 0x4/0x2 (I checked OpenBIOS, rth's palcode for Alpha, and Linux).

IRQ2 is already hard-coded in QEMU (we had to patch this for our machine
emulation, but less in recent versions), that is not the point. I'm
concerned about the behavioral changes we are discussing here, ie. the
special handling of cascading interrupt inputs.

> 
> BTW, from the palcode it looks like Alpha wants LTIM=1, so it would be nice
> to implement that one as well:
> 
>   /* ??? MILO initializes the PIC as edge triggered; I do not know how SRM
>      initializes them.  However, Linux seems to expect that these are level
>      triggered.  That may be a kernel bug, but level triggers are more
>      reliable anyway so lets go with that.  */
> 
>   /* Initialize level triggers.  The CY82C693UB that's on real alpha
>      hardware doesn't have this; this is a PIIX extension.  However,
>      QEMU doesn't implement regular level triggers.  */
>   outb(0xff, PORT_PIC2_ELCR);
>   outb(0xff, PORT_PIC1_ELCR);

Just takes someone to write the patch, as usual. :)

Jan
Paolo Bonzini - Sept. 3, 2012, 11:11 a.m.
Il 03/09/2012 12:34, Jan Kiszka ha scritto:
>> > Why is this a problem?  All of them use IRQ2 for a cascade, and initialize
>> > icw3 to 0x4/0x2 (I checked OpenBIOS, rth's palcode for Alpha, and Linux).
> IRQ2 is already hard-coded in QEMU (we had to patch this for our machine
> emulation, but less in recent versions), that is not the point. I'm
> concerned about the behavioral changes we are discussing here, ie. the
> special handling of cascading interrupt inputs.

Yeah, it's quite interesting that the behavior isn't mentioned in the
8259 datasheets.  Still in retrospect it's hard to see how it can
possibly work with edge-triggered cascaded inputs.

Paolo
Jan Kiszka - Sept. 3, 2012, 11:26 a.m.
On 2012-09-03 13:11, Paolo Bonzini wrote:
> Il 03/09/2012 12:34, Jan Kiszka ha scritto:
>>>> Why is this a problem?  All of them use IRQ2 for a cascade, and initialize
>>>> icw3 to 0x4/0x2 (I checked OpenBIOS, rth's palcode for Alpha, and Linux).
>> IRQ2 is already hard-coded in QEMU (we had to patch this for our machine
>> emulation, but less in recent versions), that is not the point. I'm
>> concerned about the behavioral changes we are discussing here, ie. the
>> special handling of cascading interrupt inputs.
> 
> Yeah, it's quite interesting that the behavior isn't mentioned in the
> 8259 datasheets.  Still in retrospect it's hard to see how it can
> possibly work with edge-triggered cascaded inputs.

As I said: by avoiding the pattern Matthew generated in his test case.
That's not impossible. All the other OSes running fine against out PIC
models are proving this.

Jan
Avi Kivity - Sept. 3, 2012, 2:39 p.m.
On 09/03/2012 11:40 AM, Andreas Färber wrote:
> Am 03.09.2012 04:56, schrieb Matthew Ogilvie:
>> diff --git a/hw/i8259_common.c b/hw/i8259_common.c
>> index ab3d98b..dcde5f2 100644
>> --- a/hw/i8259_common.c
>> +++ b/hw/i8259_common.c
> [...]
>> @@ -111,6 +112,7 @@ static const VMStateDescription vmstate_pic_common = {
>>          VMSTATE_UINT8(isr, PICCommonState),
>>          VMSTATE_UINT8(priority_add, PICCommonState),
>>          VMSTATE_UINT8(irq_base, PICCommonState),
>> +        VMSTATE_UINT8(icw3, PICCommonState),
>>          VMSTATE_UINT8(read_reg_select, PICCommonState),
>>          VMSTATE_UINT8(poll, PICCommonState),
>>          VMSTATE_UINT8(special_mask, PICCommonState),
> 
> Additional VMState needs to be versioned by incrementing .version_id and
> by specifying the new version number here. Otherwise it breaks migration.

And incrementing the version ID breaks backwards migration.  The correct
solution is subsections, copying Juan and booking a trip to the Mariana
trench.
Juan Quintela - Sept. 3, 2012, 3:42 p.m.
Avi Kivity <avi@redhat.com> wrote:
> On 09/03/2012 11:40 AM, Andreas Färber wrote:
>> Am 03.09.2012 04:56, schrieb Matthew Ogilvie:
>>> diff --git a/hw/i8259_common.c b/hw/i8259_common.c
>>> index ab3d98b..dcde5f2 100644
>>> --- a/hw/i8259_common.c
>>> +++ b/hw/i8259_common.c
>> [...]
>>> @@ -111,6 +112,7 @@ static const VMStateDescription vmstate_pic_common = {
>>>          VMSTATE_UINT8(isr, PICCommonState),
>>>          VMSTATE_UINT8(priority_add, PICCommonState),
>>>          VMSTATE_UINT8(irq_base, PICCommonState),
>>> +        VMSTATE_UINT8(icw3, PICCommonState),
>>>          VMSTATE_UINT8(read_reg_select, PICCommonState),
>>>          VMSTATE_UINT8(poll, PICCommonState),
>>>          VMSTATE_UINT8(special_mask, PICCommonState),
>> 
>> Additional VMState needs to be versioned by incrementing .version_id and
>> by specifying the new version number here. Otherwise it breaks migration.

For the subsection, only sending this when icw3 != 0 is enough?  I am
searching for a test about when we need to send it.

> And incrementing the version ID breaks backwards migration.  The correct
> solution is subsections, copying Juan and booking a trip to the Mariana
> trench.

Book also for me, no need for the return ticket.

Later, Juan.
Jan Kiszka - Sept. 3, 2012, 3:45 p.m.
On 2012-09-03 17:42, Juan Quintela wrote:
> Avi Kivity <avi@redhat.com> wrote:
>> On 09/03/2012 11:40 AM, Andreas Färber wrote:
>>> Am 03.09.2012 04:56, schrieb Matthew Ogilvie:
>>>> diff --git a/hw/i8259_common.c b/hw/i8259_common.c
>>>> index ab3d98b..dcde5f2 100644
>>>> --- a/hw/i8259_common.c
>>>> +++ b/hw/i8259_common.c
>>> [...]
>>>> @@ -111,6 +112,7 @@ static const VMStateDescription vmstate_pic_common = {
>>>>          VMSTATE_UINT8(isr, PICCommonState),
>>>>          VMSTATE_UINT8(priority_add, PICCommonState),
>>>>          VMSTATE_UINT8(irq_base, PICCommonState),
>>>> +        VMSTATE_UINT8(icw3, PICCommonState),
>>>>          VMSTATE_UINT8(read_reg_select, PICCommonState),
>>>>          VMSTATE_UINT8(poll, PICCommonState),
>>>>          VMSTATE_UINT8(special_mask, PICCommonState),
>>>
>>> Additional VMState needs to be versioned by incrementing .version_id and
>>> by specifying the new version number here. Otherwise it breaks migration.
> 
> For the subsection, only sending this when icw3 != 0 is enough?  I am
> searching for a test about when we need to send it.

See my reply on this topic in the other branch of this thread.

> 
>> And incrementing the version ID breaks backwards migration.  The correct
>> solution is subsections, copying Juan and booking a trip to the Mariana
>> trench.
> 
> Book also for me, no need for the return ticket.
> 

Hey, this scenario is rather harmless (famous last words...). ;)

Jan
Avi Kivity - Sept. 3, 2012, 3:52 p.m.
On 09/03/2012 06:42 PM, Juan Quintela wrote:
> Avi Kivity <avi@redhat.com> wrote:
>> On 09/03/2012 11:40 AM, Andreas Färber wrote:
>>> Am 03.09.2012 04:56, schrieb Matthew Ogilvie:
>>>> diff --git a/hw/i8259_common.c b/hw/i8259_common.c
>>>> index ab3d98b..dcde5f2 100644
>>>> --- a/hw/i8259_common.c
>>>> +++ b/hw/i8259_common.c
>>> [...]
>>>> @@ -111,6 +112,7 @@ static const VMStateDescription vmstate_pic_common = {
>>>>          VMSTATE_UINT8(isr, PICCommonState),
>>>>          VMSTATE_UINT8(priority_add, PICCommonState),
>>>>          VMSTATE_UINT8(irq_base, PICCommonState),
>>>> +        VMSTATE_UINT8(icw3, PICCommonState),
>>>>          VMSTATE_UINT8(read_reg_select, PICCommonState),
>>>>          VMSTATE_UINT8(poll, PICCommonState),
>>>>          VMSTATE_UINT8(special_mask, PICCommonState),
>>> 
>>> Additional VMState needs to be versioned by incrementing .version_id and
>>> by specifying the new version number here. Otherwise it breaks migration.
> 
> For the subsection, only sending this when icw3 != 0 is enough?  I am
> searching for a test about when we need to send it.

Looks like the optimal condition is ((s->icw3 & ~s->eclr) != 0) (i.e.
bit set in icw3 but clear in eclr).
Jan Kiszka - Sept. 3, 2012, 3:54 p.m.
On 2012-09-03 17:52, Avi Kivity wrote:
> On 09/03/2012 06:42 PM, Juan Quintela wrote:
>> Avi Kivity <avi@redhat.com> wrote:
>>> On 09/03/2012 11:40 AM, Andreas Färber wrote:
>>>> Am 03.09.2012 04:56, schrieb Matthew Ogilvie:
>>>>> diff --git a/hw/i8259_common.c b/hw/i8259_common.c
>>>>> index ab3d98b..dcde5f2 100644
>>>>> --- a/hw/i8259_common.c
>>>>> +++ b/hw/i8259_common.c
>>>> [...]
>>>>> @@ -111,6 +112,7 @@ static const VMStateDescription vmstate_pic_common = {
>>>>>          VMSTATE_UINT8(isr, PICCommonState),
>>>>>          VMSTATE_UINT8(priority_add, PICCommonState),
>>>>>          VMSTATE_UINT8(irq_base, PICCommonState),
>>>>> +        VMSTATE_UINT8(icw3, PICCommonState),
>>>>>          VMSTATE_UINT8(read_reg_select, PICCommonState),
>>>>>          VMSTATE_UINT8(poll, PICCommonState),
>>>>>          VMSTATE_UINT8(special_mask, PICCommonState),
>>>>
>>>> Additional VMState needs to be versioned by incrementing .version_id and
>>>> by specifying the new version number here. Otherwise it breaks migration.
>>
>> For the subsection, only sending this when icw3 != 0 is enough?  I am
>> searching for a test about when we need to send it.
> 
> Looks like the optimal condition is ((s->icw3 & ~s->eclr) != 0) (i.e.
> bit set in icw3 but clear in eclr).

The standard PC values are optimal: 4 for master, 2 for slave.

Jan
Avi Kivity - Sept. 3, 2012, 3:57 p.m.
On 09/03/2012 06:54 PM, Jan Kiszka wrote:
> On 2012-09-03 17:52, Avi Kivity wrote:
>> On 09/03/2012 06:42 PM, Juan Quintela wrote:
>>> Avi Kivity <avi@redhat.com> wrote:
>>>> On 09/03/2012 11:40 AM, Andreas Färber wrote:
>>>>> Am 03.09.2012 04:56, schrieb Matthew Ogilvie:
>>>>>> diff --git a/hw/i8259_common.c b/hw/i8259_common.c
>>>>>> index ab3d98b..dcde5f2 100644
>>>>>> --- a/hw/i8259_common.c
>>>>>> +++ b/hw/i8259_common.c
>>>>> [...]
>>>>>> @@ -111,6 +112,7 @@ static const VMStateDescription vmstate_pic_common = {
>>>>>>          VMSTATE_UINT8(isr, PICCommonState),
>>>>>>          VMSTATE_UINT8(priority_add, PICCommonState),
>>>>>>          VMSTATE_UINT8(irq_base, PICCommonState),
>>>>>> +        VMSTATE_UINT8(icw3, PICCommonState),
>>>>>>          VMSTATE_UINT8(read_reg_select, PICCommonState),
>>>>>>          VMSTATE_UINT8(poll, PICCommonState),
>>>>>>          VMSTATE_UINT8(special_mask, PICCommonState),
>>>>>
>>>>> Additional VMState needs to be versioned by incrementing .version_id and
>>>>> by specifying the new version number here. Otherwise it breaks migration.
>>>
>>> For the subsection, only sending this when icw3 != 0 is enough?  I am
>>> searching for a test about when we need to send it.
>> 
>> Looks like the optimal condition is ((s->icw3 & ~s->eclr) != 0) (i.e.
>> bit set in icw3 but clear in eclr).
> 
> The standard PC values are optimal: 4 for master, 2 for slave.

Can you explain why?  I saw that icw3 is always ORed with eclr, so my
condition will catch exactly those cases where a change in behaviour
occurs, and no more.
Jan Kiszka - Sept. 3, 2012, 4:02 p.m.
On 2012-09-03 17:57, Avi Kivity wrote:
> On 09/03/2012 06:54 PM, Jan Kiszka wrote:
>> On 2012-09-03 17:52, Avi Kivity wrote:
>>> On 09/03/2012 06:42 PM, Juan Quintela wrote:
>>>> Avi Kivity <avi@redhat.com> wrote:
>>>>> On 09/03/2012 11:40 AM, Andreas Färber wrote:
>>>>>> Am 03.09.2012 04:56, schrieb Matthew Ogilvie:
>>>>>>> diff --git a/hw/i8259_common.c b/hw/i8259_common.c
>>>>>>> index ab3d98b..dcde5f2 100644
>>>>>>> --- a/hw/i8259_common.c
>>>>>>> +++ b/hw/i8259_common.c
>>>>>> [...]
>>>>>>> @@ -111,6 +112,7 @@ static const VMStateDescription vmstate_pic_common = {
>>>>>>>          VMSTATE_UINT8(isr, PICCommonState),
>>>>>>>          VMSTATE_UINT8(priority_add, PICCommonState),
>>>>>>>          VMSTATE_UINT8(irq_base, PICCommonState),
>>>>>>> +        VMSTATE_UINT8(icw3, PICCommonState),
>>>>>>>          VMSTATE_UINT8(read_reg_select, PICCommonState),
>>>>>>>          VMSTATE_UINT8(poll, PICCommonState),
>>>>>>>          VMSTATE_UINT8(special_mask, PICCommonState),
>>>>>>
>>>>>> Additional VMState needs to be versioned by incrementing .version_id and
>>>>>> by specifying the new version number here. Otherwise it breaks migration.
>>>>
>>>> For the subsection, only sending this when icw3 != 0 is enough?  I am
>>>> searching for a test about when we need to send it.
>>>
>>> Looks like the optimal condition is ((s->icw3 & ~s->eclr) != 0) (i.e.
>>> bit set in icw3 but clear in eclr).
>>
>> The standard PC values are optimal: 4 for master, 2 for slave.
> 
> Can you explain why?  I saw that icw3 is always ORed with eclr, so my
> condition will catch exactly those cases where a change in behaviour
> occurs, and no more.

The values above are what every user of the PIC cascaded on our targets
must program to use them. So We will find them in the state once any
relevant guest code was able to run (e.g. the BIOS).

Jan
Avi Kivity - Sept. 3, 2012, 4:15 p.m.
On 09/03/2012 07:02 PM, Jan Kiszka wrote:

>>>> Looks like the optimal condition is ((s->icw3 & ~s->eclr) != 0) (i.e.
>>>> bit set in icw3 but clear in eclr).
>>>
>>> The standard PC values are optimal: 4 for master, 2 for slave.
>> 
>> Can you explain why?  I saw that icw3 is always ORed with eclr, so my
>> condition will catch exactly those cases where a change in behaviour
>> occurs, and no more.
> 
> The values above are what every user of the PIC cascaded on our targets
> must program to use them. So We will find them in the state once any
> relevant guest code was able to run (e.g. the BIOS).
> 

Suppose the bios has not run yet?
Paolo Bonzini - Sept. 3, 2012, 4:23 p.m.
Il 03/09/2012 18:15, Avi Kivity ha scritto:
>> > The values above are what every user of the PIC cascaded on our targets
>> > must program to use them. So We will find them in the state once any
>> > relevant guest code was able to run (e.g. the BIOS).
>> > 
> Suppose the bios has not run yet?

Then you transmit the subsection.

BTW this also means that simply checking against eclr|icw3 is wrong; the
right test is:

* against elcr if !s->master

* against elcr|icw3 if s->master

This makes precomputing the value more appealing.

Similarly, perhaps this:

    if (s->special_fully_nested_mode && s->master) {
        mask &= ~(1 << 2);
    }

should be changed to

    if (s->special_fully_nested_mode && s->master) {
        mask &= ~s->icw3;
    }

?

Paolo
Jan Kiszka - Sept. 3, 2012, 4:30 p.m.
On 2012-09-03 18:15, Avi Kivity wrote:
> On 09/03/2012 07:02 PM, Jan Kiszka wrote:
> 
>>>>> Looks like the optimal condition is ((s->icw3 & ~s->eclr) != 0) (i.e.
>>>>> bit set in icw3 but clear in eclr).
>>>>
>>>> The standard PC values are optimal: 4 for master, 2 for slave.
>>>
>>> Can you explain why?  I saw that icw3 is always ORed with eclr, so my
>>> condition will catch exactly those cases where a change in behaviour
>>> occurs, and no more.
>>
>> The values above are what every user of the PIC cascaded on our targets
>> must program to use them. So We will find them in the state once any
>> relevant guest code was able to run (e.g. the BIOS).
>>
> 
> Suppose the bios has not run yet?

Then we must save the 0.

Your logic will force us to save in all standard cases (ELCR's bit for
IRQ2 must not be set by a guest). So it's not really helpful.

Jan
Avi Kivity - Sept. 3, 2012, 4:30 p.m.
On 09/03/2012 07:23 PM, Paolo Bonzini wrote:
> Il 03/09/2012 18:15, Avi Kivity ha scritto:
>>> > The values above are what every user of the PIC cascaded on our targets
>>> > must program to use them. So We will find them in the state once any
>>> > relevant guest code was able to run (e.g. the BIOS).
>>> > 
>> Suppose the bios has not run yet?
> 
> Then you transmit the subsection.

And the migration fails.  Needlessly, since icw3 == 0 doesn't affect
guest operation.
Paolo Bonzini - Sept. 3, 2012, 4:33 p.m.
Il 03/09/2012 18:30, Avi Kivity ha scritto:
>>>>> The values above are what every user of the PIC cascaded on our targets
>>>>> >>> > must program to use them. So We will find them in the state once any
>>>>> >>> > relevant guest code was able to run (e.g. the BIOS).
>>>>> >>> > 
>>> >> Suppose the bios has not run yet?
>> > 
>> > Then you transmit the subsection.
> And the migration fails.  Needlessly, since icw3 == 0 doesn't affect
> guest operation.

But the point of subsections is to succeed migration in the common case,
assuming there is more than one case that doesn't affect guest operation.

Paolo
Jan Kiszka - Sept. 3, 2012, 4:40 p.m.
On 2012-09-03 18:33, Paolo Bonzini wrote:
> Il 03/09/2012 18:30, Avi Kivity ha scritto:
>>>>>> The values above are what every user of the PIC cascaded on our targets
>>>>>>>>>> must program to use them. So We will find them in the state once any
>>>>>>>>>> relevant guest code was able to run (e.g. the BIOS).
>>>>>>>>>>
>>>>>> Suppose the bios has not run yet?
>>>>
>>>> Then you transmit the subsection.
>> And the migration fails.  Needlessly, since icw3 == 0 doesn't affect
>> guest operation.
> 
> But the point of subsections is to succeed migration in the common case,
> assuming there is more than one case that doesn't affect guest operation.

The point is that the common case is icw3 = 4 (for the master), not 0.
And if we don't save that case, we must save the rest. If we don't save
this as well, we lose state information. This can't work.

Jan
Paolo Bonzini - Sept. 3, 2012, 4:56 p.m.
Il 03/09/2012 18:40, Jan Kiszka ha scritto:
>>> >> And the migration fails.  Needlessly, since icw3 == 0 doesn't affect
>>> >> guest operation.
>> > 
>> > But the point of subsections is to succeed migration in the common case,
>> > assuming there is more than one case that doesn't affect guest operation.
> The point is that the common case is icw3 = 4 (for the master), not 0.
> And if we don't save that case, we must save the rest. If we don't save
> this as well, we lose state information. This can't work.

I was agreeing with you, not Avi. :)

Paolo
Avi Kivity - Sept. 4, 2012, 8:16 a.m.
On 09/03/2012 07:33 PM, Paolo Bonzini wrote:
> Il 03/09/2012 18:30, Avi Kivity ha scritto:
>>>>>> The values above are what every user of the PIC cascaded on our targets
>>>>>> >>> > must program to use them. So We will find them in the state once any
>>>>>> >>> > relevant guest code was able to run (e.g. the BIOS).
>>>>>> >>> > 
>>>> >> Suppose the bios has not run yet?
>>> > 
>>> > Then you transmit the subsection.
>> And the migration fails.  Needlessly, since icw3 == 0 doesn't affect
>> guest operation.
> 
> But the point of subsections is to succeed migration in the common case,
> assuming there is more than one case that doesn't affect guest operation.

According to the patch, if icw3 == 4 && !(eclr & 4), then behaviour will
change.  With the standard configuration, if two pci interrupts hit at
once, then before the patch irr.2 will be clear, and afterwards set.

So we do have a behavioural change.  Is the rest of the code masking
this change under the standard configuration?
Paolo Bonzini - Sept. 4, 2012, 9:15 a.m.
Il 04/09/2012 10:16, Avi Kivity ha scritto:
>> > But the point of subsections is to succeed migration in the common case,
>> > assuming there is more than one case that doesn't affect guest operation.
> According to the patch, if icw3 == 4 && !(eclr & 4), then behaviour will
> change.  With the standard configuration, if two pci interrupts hit at
> once, then before the patch irr.2 will be clear, and afterwards set.
> 
> So we do have a behavioural change.  Is the rest of the code masking
> this change under the standard configuration?

No, it is not masking the change.  The assumption is that nothing should
care about irr.2 or isr.2, because nothing attaches an handler to the
cascade interrupt.

You have to choose between assuming this, and breaking backwards
migration.  I would rather break backwards migration, but others disagree...

Paolo
Avi Kivity - Sept. 4, 2012, 9:20 a.m.
On 09/04/2012 12:15 PM, Paolo Bonzini wrote:
> Il 04/09/2012 10:16, Avi Kivity ha scritto:
>>> > But the point of subsections is to succeed migration in the common case,
>>> > assuming there is more than one case that doesn't affect guest operation.
>> According to the patch, if icw3 == 4 && !(eclr & 4), then behaviour will
>> change.  With the standard configuration, if two pci interrupts hit at
>> once, then before the patch irr.2 will be clear, and afterwards set.
>> 
>> So we do have a behavioural change.  Is the rest of the code masking
>> this change under the standard configuration?
> 
> No, it is not masking the change.  The assumption is that nothing should
> care about irr.2 or isr.2, because nothing attaches an handler to the
> cascade interrupt.

Won't the next call to pic_get_irq() notice the difference in s->irr?

> You have to choose between assuming this, and breaking backwards
> migration.  I would rather break backwards migration, but others disagree...

Normally I'd agree, but if the only known breakee is a 1987 guest then
I'd make an exception.
BALATON Zoltan - Sept. 4, 2012, 9:29 a.m.
On Tue, 4 Sep 2012, Avi Kivity wrote:
> On 09/04/2012 12:15 PM, Paolo Bonzini wrote:
>> Il 04/09/2012 10:16, Avi Kivity ha scritto:
>>>>> But the point of subsections is to succeed migration in the common case,
>>>>> assuming there is more than one case that doesn't affect guest operation.
>>> According to the patch, if icw3 == 4 && !(eclr & 4), then behaviour will
>>> change.  With the standard configuration, if two pci interrupts hit at
>>> once, then before the patch irr.2 will be clear, and afterwards set.
>>>
>>> So we do have a behavioural change.  Is the rest of the code masking
>>> this change under the standard configuration?
>>
>> No, it is not masking the change.  The assumption is that nothing should
>> care about irr.2 or isr.2, because nothing attaches an handler to the
>> cascade interrupt.
>
> Won't the next call to pic_get_irq() notice the difference in s->irr?
>
>> You have to choose between assuming this, and breaking backwards
>> migration.  I would rather break backwards migration, but others disagree...
>
> Normally I'd agree, but if the only known breakee is a 1987 guest then
> I'd make an exception.

Another one affected by this is OpenStep 4.2 (probably NeXTstep and 
Rhapsody too) which are not exactly recent either but there are more than 
only one "breakee".

Regards,
BALATON Zoltan
Avi Kivity - Sept. 4, 2012, 9:37 a.m.
On 09/04/2012 12:29 PM, BALATON Zoltan wrote:
> On Tue, 4 Sep 2012, Avi Kivity wrote:
>> On 09/04/2012 12:15 PM, Paolo Bonzini wrote:
>>> Il 04/09/2012 10:16, Avi Kivity ha scritto:
>>>>>> But the point of subsections is to succeed migration in the common
>>>>>> case,
>>>>>> assuming there is more than one case that doesn't affect guest
>>>>>> operation.
>>>> According to the patch, if icw3 == 4 && !(eclr & 4), then behaviour
>>>> will
>>>> change.  With the standard configuration, if two pci interrupts hit at
>>>> once, then before the patch irr.2 will be clear, and afterwards set.
>>>>
>>>> So we do have a behavioural change.  Is the rest of the code masking
>>>> this change under the standard configuration?
>>>
>>> No, it is not masking the change.  The assumption is that nothing should
>>> care about irr.2 or isr.2, because nothing attaches an handler to the
>>> cascade interrupt.
>>
>> Won't the next call to pic_get_irq() notice the difference in s->irr?
>>
>>> You have to choose between assuming this, and breaking backwards
>>> migration.  I would rather break backwards migration, but others
>>> disagree...
>>
>> Normally I'd agree, but if the only known breakee is a 1987 guest then
>> I'd make an exception.
> 
> Another one affected by this is OpenStep 4.2 (probably NeXTstep and
> Rhapsody too) which are not exactly recent either but there are more
> than only one "breakee".

Those are all filed under "esoterics".

I don't mean to say we shouldn't care about them, but there are likely
to be a lot more users doing backwards migration than users running
those guests, let alone migrating them (forwards or backwards).  The
pragmatic choice is clear.
Jan Kiszka - Sept. 4, 2012, 9:51 a.m.
On 2012-09-04 11:37, Avi Kivity wrote:
> On 09/04/2012 12:29 PM, BALATON Zoltan wrote:
>> On Tue, 4 Sep 2012, Avi Kivity wrote:
>>> On 09/04/2012 12:15 PM, Paolo Bonzini wrote:
>>>> Il 04/09/2012 10:16, Avi Kivity ha scritto:
>>>>>>> But the point of subsections is to succeed migration in the common
>>>>>>> case,
>>>>>>> assuming there is more than one case that doesn't affect guest
>>>>>>> operation.
>>>>> According to the patch, if icw3 == 4 && !(eclr & 4), then behaviour
>>>>> will
>>>>> change.  With the standard configuration, if two pci interrupts hit at
>>>>> once, then before the patch irr.2 will be clear, and afterwards set.
>>>>>
>>>>> So we do have a behavioural change.  Is the rest of the code masking
>>>>> this change under the standard configuration?
>>>>
>>>> No, it is not masking the change.  The assumption is that nothing should
>>>> care about irr.2 or isr.2, because nothing attaches an handler to the
>>>> cascade interrupt.
>>>
>>> Won't the next call to pic_get_irq() notice the difference in s->irr?
>>>
>>>> You have to choose between assuming this, and breaking backwards
>>>> migration.  I would rather break backwards migration, but others
>>>> disagree...
>>>
>>> Normally I'd agree, but if the only known breakee is a 1987 guest then
>>> I'd make an exception.
>>
>> Another one affected by this is OpenStep 4.2 (probably NeXTstep and
>> Rhapsody too) which are not exactly recent either but there are more
>> than only one "breakee".
> 
> Those are all filed under "esoterics".
> 
> I don't mean to say we shouldn't care about them, but there are likely
> to be a lot more users doing backwards migration than users running
> those guests, let alone migrating them (forwards or backwards).  The
> pragmatic choice is clear.

BTW, did anyone actually test backward migration recently? I thought to
remember I effectively broke it in 1.1 with some changes to the i8259
(or was it the PIT?) vmstate, and no one really cared about this or my
first proposals to fix it.

Jan
Paolo Bonzini - Sept. 4, 2012, 10:06 a.m.
Il 04/09/2012 11:51, Jan Kiszka ha scritto:
>> > 
>> > I don't mean to say we shouldn't care about them, but there are likely
>> > to be a lot more users doing backwards migration than users running
>> > those guests, let alone migrating them (forwards or backwards).  The
>> > pragmatic choice is clear.
> BTW, did anyone actually test backward migration recently? I thought to
> remember I effectively broke it in 1.1 with some changes to the i8259
> (or was it the PIT?) vmstate, and no one really cared about this or my
> first proposals to fix it.

Correct: commit ce967e2 (i8254: Rework & fix interaction with HPET in
legacy mode, 2012-02-01) bumped the PIT version from 2 to 3.  RTC
changes will break it more in 1.3.

Honestly, backwards migration only works on "enterprise" qemu-kvm
because it is tested only there.  And so far the only sample across
major releases is that RHEL6->RHEL5 migration didn't work.

Here the choice is between changing guest behavior by defaulting to 4/2,
and always transmitting the subsection by defaulting to 0/0.  The latter
makes the subsection useless, so at that point we might as well bump the
version number and board said flight to the Pacific.

Paolo
Avi Kivity - Sept. 4, 2012, 10:44 a.m.
On 09/04/2012 01:06 PM, Paolo Bonzini wrote:
> Il 04/09/2012 11:51, Jan Kiszka ha scritto:
>>> > 
>>> > I don't mean to say we shouldn't care about them, but there are likely
>>> > to be a lot more users doing backwards migration than users running
>>> > those guests, let alone migrating them (forwards or backwards).  The
>>> > pragmatic choice is clear.
>> BTW, did anyone actually test backward migration recently? I thought to
>> remember I effectively broke it in 1.1 with some changes to the i8259
>> (or was it the PIT?) vmstate, and no one really cared about this or my
>> first proposals to fix it.
> 
> Correct: commit ce967e2 (i8254: Rework & fix interaction with HPET in
> legacy mode, 2012-02-01) bumped the PIT version from 2 to 3.  

Gah, this is 1.1, so there's no way to fix it.

> RTC
> changes will break it more in 1.3.
> 
> Honestly, backwards migration only works on "enterprise" qemu-kvm
> because it is tested only there.  And so far the only sample across
> major releases is that RHEL6->RHEL5 migration didn't work.

I expect that we will want 7->6, though we may have to settle for minor
releases only.

> Here the choice is between changing guest behavior by defaulting to 4/2,
> and always transmitting the subsection by defaulting to 0/0.  The latter
> makes the subsection useless, so at that point we might as well bump the
> version number and board said flight to the Pacific.

So we do the 4/2 thing.  I expect if we go they'll tell us they don't do
backwards flights.
Maciej W. Rozycki - Sept. 4, 2012, 2:29 p.m.
On Mon, 3 Sep 2012, Jan Kiszka wrote:

> >   - Qemu output (without this patch):
> >         elcr=0c00 cmdRead ummask mask sti irq15 unmask DONE
> > 
> > But on real hardware, the master seems to treat IRQ2 as level triggered,

 That is not universally true, however in reality it does not matter, more 
on that below.

> > and doesn't deliver an interrupt to the CPU until the slave unmasks IRQ14.
> > I've tried this on several machines, including a non-PCI 486 that
> > does not seem to have ELCR registers.
> > 
> >   - Typical output (pentium/pentium pro/dual AMD Athlon/etc; slight
> >     variations in some elcr bits depending on machine, but bit 0x04
> >     is always clear):
> >         elcr=0c20 cmdRead unmask mask sti unmask irq14 DONE
> > 
> >   - 486 without elcr (just an ISA bus):
> >         elcr=e0e0 cmdRead unmask mask sti unmask irq14 DONE
> > 
> >   - One failure: It doesn't boot properly (no output) with a USB floppy
> >     drive on my Intel Core I7.  Guess: The test program just barely
> >     fits in a sector, with no room for any tables (partition/etc)
> >     that BIOS might check for if it isn't an original, native floppy
> >     drive.
> > 
> > -----------------------
> > 
> > I've found a few descriptions of programming the i8259.
> 
> I was mostly following the official "8259A PROGRAMMABLE INTERRUPT
> CONTROLLER (8259A 8259A-2)" by Intel. But it also doesn't mention any
> trigger mode changes for the cascading input lines.

 The datasheet is vague on the subject and there is apparently a common 
misconception on how the 8259A trigger modes work.  PIC core 
modifications, especially the ELCR (originally added for EISA support), 
seen in chipsets obfuscate the matter further.

 So first of all, the *output* of the 8259A is always edge triggered, 
regardless of whether it's the master or one of the slaves (only one slave 
is used in the PC/AT architecture, but up to eight are supported; the 
PC/XT had none).  If once an interrupt has been served any interrupts 
remain in the pending state, then the 8259A deasserts its output briefly 
before reasserting it.

 Secondly, for inputs the original 8259A only allowed a global (i.e. for 
all at once) edge/level selection via ICW1.  That somehow had to be 
compatible with the master/slave mode.

 However as of the 8259A "edge-triggered" in Intel's nomenclature meant: 
"asserted with a low-to-high transition at the time an interrupt is 
registered and then kept high until the interrupt is served via one of the 
EOI mechanisms (there are a few, including AEOI) or goes away unhandled."  
This is important for the output because the x86 architecture's definition 
for the INT input is level-triggered.

 The edge-triggered mode was also the only one supported for inputs with 
the original 8080-compatible 8259; note that for inputs a high-to-low 
transition cancels the interrupt as in the level-triggered mode.  The 
"level-triggered" mode was then added with the usual meaning -- 
essentially copying the interrupt input onto the output, ORing with the 
other inputs.

 So thus defined "edge-triggered" mode is in fact compatible with both 
edge-triggered and level-triggered inputs -- the former work by means of 
observing the low-to-high transition and the latter also work by only 
seeing the interrupt go away briefly -- but that does not really matter 
for level-triggered interrupts as the 8259A still knows an IRQ is pending 
and if a race causes an INTA cycle to arrive at the time the PIC's output 
is low it will still correctly serve the vector (I think in practice it 
does not really happen as the vector is served with the second INTA cycle 
only and by then the interrupt output must have recovered).

 There were some interesting errata in some early embedded 8259A cores in 
EISA chipses where the brief deassertion of the output did not happen.  
That worked just fine when wired directly to an x86 processor (because as 
noted above the INT input is level-triggered), however when connected to 
an i82489DX discrete APIC, that interprets the edge-triggered mode in the 
traditional way and implies that mode for inputs cascaded to 8259A, the 
problem had to be worked around in hardware, by adding extra logic to 
regenerate the missing high-to-low-to-high transition.

 Some of the information presented here was only noted by Intel in APIC 
documentation; I may track down and quote references, but these documents 
have never been available online anyway.

 Did that help?

  Maciej

Patch

==========================================

 hw/i8259.c          | 12 ++++++++----
 hw/i8259_common.c   |  2 ++
 hw/i8259_internal.h |  1 +
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/hw/i8259.c b/hw/i8259.c
index 6587666..8e2f9f4 100644
--- a/hw/i8259.c
+++ b/hw/i8259.c
@@ -140,7 +140,7 @@  static void pic_set_irq(void *opaque, int irq, int level)
     }
 #endif
 
-    if (s->elcr & mask) {
+    if ((s->elcr | s->icw3) & mask) {
         /* level triggered */
         if (level) {
             s->irr |= mask;
@@ -174,7 +174,7 @@  static void pic_intack(PICCommonState *s, int irq)
         s->isr |= (1 << irq);
     }
     /* We don't clear a level sensitive interrupt here */
-    if (!(s->elcr & (1 << irq))) {
+    if (!((s->elcr | s->icw3) & (1 << irq))) {
         s->irr &= ~(1 << irq);
     }
     pic_update_irq(s);
@@ -314,6 +314,10 @@  static void pic_ioport_write(void *opaque, target_phys_addr_t addr64,
             s->init_state = s->single_mode ? (s->init4 ? 3 : 0) : 2;
             break;
         case 2:
+            s->icw3 = val;
+            /* TODO: Enforce constraints?: Master is typically
+             *   0x04 for IRQ2 (maybe 0 is also OK).  Slave must be 0.
+             */
             if (s->init4) {
                 s->init_state = 3;
             } else {
@@ -419,9 +423,9 @@  void pic_info(Monitor *mon)
     for (i = 0; i < 2; i++) {
         s = i == 0 ? DO_UPCAST(PICCommonState, dev.qdev, isa_pic) : slave_pic;
         monitor_printf(mon, "pic%d: irr=%02x imr=%02x isr=%02x hprio=%d "
-                       "irq_base=%02x rr_sel=%d elcr=%02x fnm=%d\n",
+                       "irq_base=%02x icw3=%02x rr_sel=%d elcr=%02x fnm=%d\n",
                        i, s->irr, s->imr, s->isr, s->priority_add,
-                       s->irq_base, s->read_reg_select, s->elcr,
+                       s->irq_base, s->icw3, s->read_reg_select, s->elcr,
                        s->special_fully_nested_mode);
     }
 }
diff --git a/hw/i8259_common.c b/hw/i8259_common.c
index ab3d98b..dcde5f2 100644
--- a/hw/i8259_common.c
+++ b/hw/i8259_common.c
@@ -33,6 +33,7 @@  void pic_reset_common(PICCommonState *s)
     s->isr = 0;
     s->priority_add = 0;
     s->irq_base = 0;
+    s->icw3 = 0;
     s->read_reg_select = 0;
     s->poll = 0;
     s->special_mask = 0;
@@ -111,6 +112,7 @@  static const VMStateDescription vmstate_pic_common = {
         VMSTATE_UINT8(isr, PICCommonState),
         VMSTATE_UINT8(priority_add, PICCommonState),
         VMSTATE_UINT8(irq_base, PICCommonState),
+        VMSTATE_UINT8(icw3, PICCommonState),
         VMSTATE_UINT8(read_reg_select, PICCommonState),
         VMSTATE_UINT8(poll, PICCommonState),
         VMSTATE_UINT8(special_mask, PICCommonState),
diff --git a/hw/i8259_internal.h b/hw/i8259_internal.h
index 4137b61..de67d49 100644
--- a/hw/i8259_internal.h
+++ b/hw/i8259_internal.h
@@ -55,6 +55,7 @@  struct PICCommonState {
     uint8_t isr; /* interrupt service register */
     uint8_t priority_add; /* highest irq priority */
     uint8_t irq_base;
+    uint8_t icw3; /* bit set if line is connected to a slave */
     uint8_t read_reg_select;
     uint8_t poll;
     uint8_t special_mask;