diff mbox

[1/2] KVM: fix i8259 interrupt high to low transition logic

Message ID 1347240563-6212-1-git-send-email-mmogilvi_qemu@miniinfo.net
State New
Headers show

Commit Message

Matthew Ogilvie Sept. 10, 2012, 1:29 a.m. UTC
Intel's definition of "edge triggered" means: "asserted with a
low-to-high transition at the time an interrupt is registered
and then kept high until the interrupt is served via one of the
EOI mechanisms or goes away unhandled."

So the only difference between edge triggered and level triggered
is in the leading edge, with no difference in the trailing edge.

This bug manifested itself when the guest was Microport UNIX
System V/386 v2.1 (ca. 1987), because it would sometimes mask
off IRQ14 in the slave IMR after it had already been asserted.
The master would still try to deliver an interrupt even though
IRQ2 had dropped again, resulting in a spurious interupt
(IRQ15) and a panicked UNIX kernel.

Signed-off-by: Matthew Ogilvie <mmogilvi_qemu@miniinfo.net>
---

There has been some discussions about this on the qemu mailing
list; for example see
http://www.mail-archive.com/qemu-devel@nongnu.org/msg129071.html

This fixes the test program I wrote to demonstrate the problem.  See
http://home.comcast.net/~mmogilvi/downloads/i8259-imr-test-2012-09-02.tar.bz2
(for source code to a floppy disk boot sector) or a kvm-unit-test patch
I'm posting separately.

Unfortunately, Microport UNIX still has some some unknown interrupt
problem when run under qemu with KVM enabled, although it
runs fine under plain QEMU [no KVM] with the corresponding
QEMU patch.  The unknown interrupt bug actually triggers
much earlier than this one: this one doesn't show up until
the boot floppy accesses the hard drive after some prompts,
but the unknown bug shows up early in bootup.  Also, the specific
function in which UNIX panics is different: this
one in splint(); the unknown bug in splx().  Also, the KVM
behavior is not affected by the presense or absense of this patch.

I'll probably continue looking into the unknown problem, but I'm not
sure when I'll have any results.

Despite this unknown problem, I'm confident that this known problem
should be fixed, and if not fixed will cause problems when I figure
out the unknown problem.

 arch/x86/kvm/i8254.c |  6 +++++-
 arch/x86/kvm/i8259.c | 14 ++++++--------
 2 files changed, 11 insertions(+), 9 deletions(-)

Comments

Maciej W. Rozycki Sept. 11, 2012, 12:49 a.m. UTC | #1
On Sun, 9 Sep 2012, Matthew Ogilvie wrote:

> This bug manifested itself when the guest was Microport UNIX
> System V/386 v2.1 (ca. 1987), because it would sometimes mask
> off IRQ14 in the slave IMR after it had already been asserted.
> The master would still try to deliver an interrupt even though
> IRQ2 had dropped again, resulting in a spurious interupt
> (IRQ15) and a panicked UNIX kernel.

 That is quite weird actually -- from my experience the spurious vector is 
never sent from a slave (quite understandably -- since the interrupt is 
gone and no other is pending, the master has no reason to select a slave 
to supply a vector and therefore supplies the spurious vector itself) and 
therefore a spurious IRQ7 is always issued regardless of whether the 
discarded request came from a slave or from the master.

 Is there a bug elsewhere then too?  I would have expected a reasonable 
(and especially an old-school) x86 OS to be able to cope with spurious 
8259A interrupts, but then obviously one would expect them on IRQ7 only.

  Maciej
Matthew Ogilvie Sept. 11, 2012, 4:54 a.m. UTC | #2
On Tue, Sep 11, 2012 at 01:49:51AM +0100, Maciej W. Rozycki wrote:
> On Sun, 9 Sep 2012, Matthew Ogilvie wrote:
> 
> > This bug manifested itself when the guest was Microport UNIX
> > System V/386 v2.1 (ca. 1987), because it would sometimes mask
> > off IRQ14 in the slave IMR after it had already been asserted.
> > The master would still try to deliver an interrupt even though
> > IRQ2 had dropped again, resulting in a spurious interupt
> > (IRQ15) and a panicked UNIX kernel.
> 
>  That is quite weird actually -- from my experience the spurious vector is 
> never sent from a slave (quite understandably -- since the interrupt is 
> gone and no other is pending, the master has no reason to select a slave 
> to supply a vector and therefore supplies the spurious vector itself) and 
> therefore a spurious IRQ7 is always issued regardless of whether the 
> discarded request came from a slave or from the master.

Keep in mind that this paragraph is describing QEMU's 8259 device
model behavior (and also KVM's), not real hardware.  Reading the
unpatched code, the master clearly latches on to the momentary IRQ2,
does not cancel it when it is cleared again, and ultimately delivers
a spurious IRQ15.

As for what the OS is doing with the IRQ15 (or IRQ7), I only have a large
dissamebly listing (with only a vague idea of it's overall interrupt
handling strategy), and some printf logs of stuff happening in the
8259 model when the OS is running (more useful).

> 
>  Is there a bug elsewhere then too?  I would have expected a reasonable 
> (and especially an old-school) x86 OS to be able to cope with spurious 
> 8259A interrupts, but then obviously one would expect them on IRQ7 only.
> 
>   Maciej
Jan Kiszka Sept. 11, 2012, 9:04 a.m. UTC | #3
On 2012-09-11 02:49, Maciej W. Rozycki wrote:
> On Sun, 9 Sep 2012, Matthew Ogilvie wrote:
> 
>> This bug manifested itself when the guest was Microport UNIX
>> System V/386 v2.1 (ca. 1987), because it would sometimes mask
>> off IRQ14 in the slave IMR after it had already been asserted.
>> The master would still try to deliver an interrupt even though
>> IRQ2 had dropped again, resulting in a spurious interupt
>> (IRQ15) and a panicked UNIX kernel.
> 
>  That is quite weird actually -- from my experience the spurious vector is 
> never sent from a slave (quite understandably -- since the interrupt is 
> gone and no other is pending, the master has no reason to select a slave 
> to supply a vector and therefore supplies the spurious vector itself) and 
> therefore a spurious IRQ7 is always issued regardless of whether the 
> discarded request came from a slave or from the master.

As we do not clear IRQ14 in IRR of the slave nor do we clear IRQ2 of the
master, the master has a good reason to ask the slave for the vector.

Jan
Maciej W. Rozycki Sept. 11, 2012, 11:53 a.m. UTC | #4
On Mon, 10 Sep 2012, Matthew Ogilvie wrote:

> > > This bug manifested itself when the guest was Microport UNIX
> > > System V/386 v2.1 (ca. 1987), because it would sometimes mask
> > > off IRQ14 in the slave IMR after it had already been asserted.
> > > The master would still try to deliver an interrupt even though
> > > IRQ2 had dropped again, resulting in a spurious interupt
> > > (IRQ15) and a panicked UNIX kernel.
> > 
> >  That is quite weird actually -- from my experience the spurious vector is 
> > never sent from a slave (quite understandably -- since the interrupt is 
> > gone and no other is pending, the master has no reason to select a slave 
> > to supply a vector and therefore supplies the spurious vector itself) and 
> > therefore a spurious IRQ7 is always issued regardless of whether the 
> > discarded request came from a slave or from the master.
> 
> Keep in mind that this paragraph is describing QEMU's 8259 device
> model behavior (and also KVM's), not real hardware.  Reading the
> unpatched code, the master clearly latches on to the momentary IRQ2,
> does not cancel it when it is cleared again, and ultimately delivers
> a spurious IRQ15.

 Well, it is your software model I am writing about.  IIRC either 
(according to your previous understanding of the edge trigger mode) the 
master should latch IRQ2 and the slave IRQ14 both at a time until 
ackonwledged or both should (correctly) let it go.  So, depending on the 
model implemented, you should see either IRQ14 or IRQ7 delivered, but 
never IRQ15.  It does not make sense to me when you latch the cascade 
input in the master but no corresponding actual input in the slave, the 
chips are symmetrical.

 Anyway I infer you have corrected the model now and as a side effect no 
spurious IRQ15 is going to be delivered ever, right?

  Maciej
Avi Kivity Sept. 12, 2012, 8:01 a.m. UTC | #5
On 09/10/2012 04:29 AM, Matthew Ogilvie wrote:
> Intel's definition of "edge triggered" means: "asserted with a
> low-to-high transition at the time an interrupt is registered
> and then kept high until the interrupt is served via one of the
> EOI mechanisms or goes away unhandled."
> 
> So the only difference between edge triggered and level triggered
> is in the leading edge, with no difference in the trailing edge.
> 
> This bug manifested itself when the guest was Microport UNIX
> System V/386 v2.1 (ca. 1987), because it would sometimes mask
> off IRQ14 in the slave IMR after it had already been asserted.
> The master would still try to deliver an interrupt even though
> IRQ2 had dropped again, resulting in a spurious interupt
> (IRQ15) and a panicked UNIX kernel.
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index adba28f..5cbba99 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -302,8 +302,12 @@ static void pit_do_work(struct kthread_work *work)
>  	}
>  	spin_unlock(&ps->inject_lock);
>  	if (inject) {
> -		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
> +		/* Clear previous interrupt, then create a rising
> +		 * edge to request another interupt, and leave it at
> +		 * level=1 until time to inject another one.
> +		 */
>  		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0);
> +		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
>  
>  		/*

I thought I understood this, now I'm not sure.  How can this be correct?
 Real hardware doesn't act like this.

What if the PIT is disabled after this?  You're injecting a spurious
interrupt then.
Jan Kiszka Sept. 12, 2012, 8:48 a.m. UTC | #6
On 2012-09-12 10:01, Avi Kivity wrote:
> On 09/10/2012 04:29 AM, Matthew Ogilvie wrote:
>> Intel's definition of "edge triggered" means: "asserted with a
>> low-to-high transition at the time an interrupt is registered
>> and then kept high until the interrupt is served via one of the
>> EOI mechanisms or goes away unhandled."
>>
>> So the only difference between edge triggered and level triggered
>> is in the leading edge, with no difference in the trailing edge.
>>
>> This bug manifested itself when the guest was Microport UNIX
>> System V/386 v2.1 (ca. 1987), because it would sometimes mask
>> off IRQ14 in the slave IMR after it had already been asserted.
>> The master would still try to deliver an interrupt even though
>> IRQ2 had dropped again, resulting in a spurious interupt
>> (IRQ15) and a panicked UNIX kernel.
>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
>> index adba28f..5cbba99 100644
>> --- a/arch/x86/kvm/i8254.c
>> +++ b/arch/x86/kvm/i8254.c
>> @@ -302,8 +302,12 @@ static void pit_do_work(struct kthread_work *work)
>>  	}
>>  	spin_unlock(&ps->inject_lock);
>>  	if (inject) {
>> -		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
>> +		/* Clear previous interrupt, then create a rising
>> +		 * edge to request another interupt, and leave it at
>> +		 * level=1 until time to inject another one.
>> +		 */
>>  		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0);
>> +		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
>>  
>>  		/*
> 
> I thought I understood this, now I'm not sure.  How can this be correct?
>  Real hardware doesn't act like this.
> 
> What if the PIT is disabled after this?  You're injecting a spurious
> interrupt then.

Yes, the PIT has to raise the output as long as specified, i.e.
according to the datasheet. That's important now due to the corrections
to the PIC. We can then carefully check if there is room for
simplifications / optimizations. I also cannot imagine that the above
already fulfills these requirements.

And if the PIT is disabled by the HPET, we need to clear the output
explicitly as we inject the IRQ#0 under a different source ID than
userspace HPET does (which will logically take over IRQ#0 control). The
kernel would otherwise OR both sources to an incorrect result.

Jan
Avi Kivity Sept. 12, 2012, 8:51 a.m. UTC | #7
On 09/12/2012 11:48 AM, Jan Kiszka wrote:
> On 2012-09-12 10:01, Avi Kivity wrote:
>> On 09/10/2012 04:29 AM, Matthew Ogilvie wrote:
>>> Intel's definition of "edge triggered" means: "asserted with a
>>> low-to-high transition at the time an interrupt is registered
>>> and then kept high until the interrupt is served via one of the
>>> EOI mechanisms or goes away unhandled."
>>>
>>> So the only difference between edge triggered and level triggered
>>> is in the leading edge, with no difference in the trailing edge.
>>>
>>> This bug manifested itself when the guest was Microport UNIX
>>> System V/386 v2.1 (ca. 1987), because it would sometimes mask
>>> off IRQ14 in the slave IMR after it had already been asserted.
>>> The master would still try to deliver an interrupt even though
>>> IRQ2 had dropped again, resulting in a spurious interupt
>>> (IRQ15) and a panicked UNIX kernel.
>>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
>>> index adba28f..5cbba99 100644
>>> --- a/arch/x86/kvm/i8254.c
>>> +++ b/arch/x86/kvm/i8254.c
>>> @@ -302,8 +302,12 @@ static void pit_do_work(struct kthread_work *work)
>>>  	}
>>>  	spin_unlock(&ps->inject_lock);
>>>  	if (inject) {
>>> -		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
>>> +		/* Clear previous interrupt, then create a rising
>>> +		 * edge to request another interupt, and leave it at
>>> +		 * level=1 until time to inject another one.
>>> +		 */
>>>  		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0);
>>> +		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
>>>  
>>>  		/*
>> 
>> I thought I understood this, now I'm not sure.  How can this be correct?
>>  Real hardware doesn't act like this.
>> 
>> What if the PIT is disabled after this?  You're injecting a spurious
>> interrupt then.
> 
> Yes, the PIT has to raise the output as long as specified, i.e.
> according to the datasheet. That's important now due to the corrections
> to the PIC. We can then carefully check if there is room for
> simplifications / optimizations. I also cannot imagine that the above
> already fulfills these requirements.
> 
> And if the PIT is disabled by the HPET, we need to clear the output
> explicitly as we inject the IRQ#0 under a different source ID than
> userspace HPET does (which will logically take over IRQ#0 control). The
> kernel would otherwise OR both sources to an incorrect result.
> 

I guess we need to double the hrtimer rate then in order to generate a
square wave.  It's getting ridiculous how accurate our model needs to be.
Jan Kiszka Sept. 12, 2012, 8:57 a.m. UTC | #8
On 2012-09-12 10:51, Avi Kivity wrote:
> On 09/12/2012 11:48 AM, Jan Kiszka wrote:
>> On 2012-09-12 10:01, Avi Kivity wrote:
>>> On 09/10/2012 04:29 AM, Matthew Ogilvie wrote:
>>>> Intel's definition of "edge triggered" means: "asserted with a
>>>> low-to-high transition at the time an interrupt is registered
>>>> and then kept high until the interrupt is served via one of the
>>>> EOI mechanisms or goes away unhandled."
>>>>
>>>> So the only difference between edge triggered and level triggered
>>>> is in the leading edge, with no difference in the trailing edge.
>>>>
>>>> This bug manifested itself when the guest was Microport UNIX
>>>> System V/386 v2.1 (ca. 1987), because it would sometimes mask
>>>> off IRQ14 in the slave IMR after it had already been asserted.
>>>> The master would still try to deliver an interrupt even though
>>>> IRQ2 had dropped again, resulting in a spurious interupt
>>>> (IRQ15) and a panicked UNIX kernel.
>>>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
>>>> index adba28f..5cbba99 100644
>>>> --- a/arch/x86/kvm/i8254.c
>>>> +++ b/arch/x86/kvm/i8254.c
>>>> @@ -302,8 +302,12 @@ static void pit_do_work(struct kthread_work *work)
>>>>  	}
>>>>  	spin_unlock(&ps->inject_lock);
>>>>  	if (inject) {
>>>> -		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
>>>> +		/* Clear previous interrupt, then create a rising
>>>> +		 * edge to request another interupt, and leave it at
>>>> +		 * level=1 until time to inject another one.
>>>> +		 */
>>>>  		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0);
>>>> +		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
>>>>  
>>>>  		/*
>>>
>>> I thought I understood this, now I'm not sure.  How can this be correct?
>>>  Real hardware doesn't act like this.
>>>
>>> What if the PIT is disabled after this?  You're injecting a spurious
>>> interrupt then.
>>
>> Yes, the PIT has to raise the output as long as specified, i.e.
>> according to the datasheet. That's important now due to the corrections
>> to the PIC. We can then carefully check if there is room for
>> simplifications / optimizations. I also cannot imagine that the above
>> already fulfills these requirements.
>>
>> And if the PIT is disabled by the HPET, we need to clear the output
>> explicitly as we inject the IRQ#0 under a different source ID than
>> userspace HPET does (which will logically take over IRQ#0 control). The
>> kernel would otherwise OR both sources to an incorrect result.
>>
> 
> I guess we need to double the hrtimer rate then in order to generate a
> square wave.  It's getting ridiculous how accurate our model needs to be.

I would suggest to solve this for the userspace model first, ensure that
it works properly in all modes, maybe optimize it, and then decide how
to map all this on kernel space. As long as we have two models, we can
also make use of them.

Jan
Avi Kivity Sept. 12, 2012, 9:02 a.m. UTC | #9
On 09/12/2012 11:57 AM, Jan Kiszka wrote:
> On 2012-09-12 10:51, Avi Kivity wrote:
>> On 09/12/2012 11:48 AM, Jan Kiszka wrote:
>>> On 2012-09-12 10:01, Avi Kivity wrote:
>>>> On 09/10/2012 04:29 AM, Matthew Ogilvie wrote:
>>>>> Intel's definition of "edge triggered" means: "asserted with a
>>>>> low-to-high transition at the time an interrupt is registered
>>>>> and then kept high until the interrupt is served via one of the
>>>>> EOI mechanisms or goes away unhandled."
>>>>>
>>>>> So the only difference between edge triggered and level triggered
>>>>> is in the leading edge, with no difference in the trailing edge.
>>>>>
>>>>> This bug manifested itself when the guest was Microport UNIX
>>>>> System V/386 v2.1 (ca. 1987), because it would sometimes mask
>>>>> off IRQ14 in the slave IMR after it had already been asserted.
>>>>> The master would still try to deliver an interrupt even though
>>>>> IRQ2 had dropped again, resulting in a spurious interupt
>>>>> (IRQ15) and a panicked UNIX kernel.
>>>>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
>>>>> index adba28f..5cbba99 100644
>>>>> --- a/arch/x86/kvm/i8254.c
>>>>> +++ b/arch/x86/kvm/i8254.c
>>>>> @@ -302,8 +302,12 @@ static void pit_do_work(struct kthread_work *work)
>>>>>  	}
>>>>>  	spin_unlock(&ps->inject_lock);
>>>>>  	if (inject) {
>>>>> -		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
>>>>> +		/* Clear previous interrupt, then create a rising
>>>>> +		 * edge to request another interupt, and leave it at
>>>>> +		 * level=1 until time to inject another one.
>>>>> +		 */
>>>>>  		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0);
>>>>> +		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
>>>>>  
>>>>>  		/*
>>>>
>>>> I thought I understood this, now I'm not sure.  How can this be correct?
>>>>  Real hardware doesn't act like this.
>>>>
>>>> What if the PIT is disabled after this?  You're injecting a spurious
>>>> interrupt then.
>>>
>>> Yes, the PIT has to raise the output as long as specified, i.e.
>>> according to the datasheet. That's important now due to the corrections
>>> to the PIC. We can then carefully check if there is room for
>>> simplifications / optimizations. I also cannot imagine that the above
>>> already fulfills these requirements.
>>>
>>> And if the PIT is disabled by the HPET, we need to clear the output
>>> explicitly as we inject the IRQ#0 under a different source ID than
>>> userspace HPET does (which will logically take over IRQ#0 control). The
>>> kernel would otherwise OR both sources to an incorrect result.
>>>
>> 
>> I guess we need to double the hrtimer rate then in order to generate a
>> square wave.  It's getting ridiculous how accurate our model needs to be.
> 
> I would suggest to solve this for the userspace model first, ensure that
> it works properly in all modes, maybe optimize it, and then decide how
> to map all this on kernel space. As long as we have two models, we can
> also make use of them.

Good idea.
Matthew Ogilvie Sept. 13, 2012, 5:49 a.m. UTC | #10
On Wed, Sep 12, 2012 at 10:57:57AM +0200, Jan Kiszka wrote:
> On 2012-09-12 10:51, Avi Kivity wrote:
> > On 09/12/2012 11:48 AM, Jan Kiszka wrote:
> >> On 2012-09-12 10:01, Avi Kivity wrote:
> >>> On 09/10/2012 04:29 AM, Matthew Ogilvie wrote:
> >>>> Intel's definition of "edge triggered" means: "asserted with a
> >>>> low-to-high transition at the time an interrupt is registered
> >>>> and then kept high until the interrupt is served via one of the
> >>>> EOI mechanisms or goes away unhandled."
> >>>>
> >>>> So the only difference between edge triggered and level triggered
> >>>> is in the leading edge, with no difference in the trailing edge.
> >>>>
> >>>> This bug manifested itself when the guest was Microport UNIX
> >>>> System V/386 v2.1 (ca. 1987), because it would sometimes mask
> >>>> off IRQ14 in the slave IMR after it had already been asserted.
> >>>> The master would still try to deliver an interrupt even though
> >>>> IRQ2 had dropped again, resulting in a spurious interupt
> >>>> (IRQ15) and a panicked UNIX kernel.
> >>>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> >>>> index adba28f..5cbba99 100644
> >>>> --- a/arch/x86/kvm/i8254.c
> >>>> +++ b/arch/x86/kvm/i8254.c
> >>>> @@ -302,8 +302,12 @@ static void pit_do_work(struct kthread_work *work)
> >>>>  	}
> >>>>  	spin_unlock(&ps->inject_lock);
> >>>>  	if (inject) {
> >>>> -		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
> >>>> +		/* Clear previous interrupt, then create a rising
> >>>> +		 * edge to request another interupt, and leave it at
> >>>> +		 * level=1 until time to inject another one.
> >>>> +		 */
> >>>>  		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0);
> >>>> +		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
> >>>>  
> >>>>  		/*
> >>>
> >>> I thought I understood this, now I'm not sure.  How can this be correct?
> >>>  Real hardware doesn't act like this.
> >>>
> >>> What if the PIT is disabled after this?  You're injecting a spurious
> >>> interrupt then.
> >>
> >> Yes, the PIT has to raise the output as long as specified, i.e.
> >> according to the datasheet. That's important now due to the corrections
> >> to the PIC. We can then carefully check if there is room for
> >> simplifications / optimizations. I also cannot imagine that the above
> >> already fulfills these requirements.
> >>
> >> And if the PIT is disabled by the HPET, we need to clear the output
> >> explicitly as we inject the IRQ#0 under a different source ID than
> >> userspace HPET does (which will logically take over IRQ#0 control). The
> >> kernel would otherwise OR both sources to an incorrect result.
> >>
> > 
> > I guess we need to double the hrtimer rate then in order to generate a
> > square wave.  It's getting ridiculous how accurate our model needs to be.
> 
> I would suggest to solve this for the userspace model first, ensure that
> it works properly in all modes, maybe optimize it, and then decide how
> to map all this on kernel space. As long as we have two models, we can
> also make use of them.

Thoughts about the 8254 PIT:

First, this summary of (real) 8254 PIT behavior seems fairly
good, as far it goes:

On Tue, Sep 04, 2012 at 07:27:38PM +0100, Maciej W. Rozycki wrote:
>  * The 8254 PIT is normally configured in mode 2 or 3 in the PC/AT
>    architecture.  In the former its output is high (active) all the time
>    except from one (last) clock cycle.  In the latter a wave that has a
>    duty cycle close or equal to 0.5 (depending on whether the divider is
>    odd or even) is produced, so no short pulses either.  I don't remember
>    the other four modes -- have a look at the datasheet if interested, but
>    I reckon they're not really compatible with the wiring anyway, e.g. the
>    gate is hardwired enabled.

I've also just skimmed parts of the 8254 section of "The Indispensable PC
Hardware Book", by Hans-Peter Messmer, Copyright 1994 Addison-Wesley,
although I probably ought to read it more carefully.

Under "normal" conditions, the 8254 part of the patch above should be
indistinguishable from previous behavior.  The 8259's IRR will
still show up as 1 until the interrupt is actually serviced,
and no new interrupt will be serviced after one is serviced until
another edge is injected via the high-low-high transition of the new
code.  (Unless the guest resets the 8259 or maybe messes with IMR,
but real hardware would generate extra interrupts in such cases as
well.)

The new code sounds much closer to mode 2 described by
Maciej, compared to the old code - except the duty cycle is
effectively 100 percent instead of 99.[some number of 9's] percent.

-----------------
But there might be some concerns in abnormal conditions:

   * If some guest is actually depending on a 50 percent duty cycle
     (maybe some kind of polling rather than interrupts), I would
     expect it to be just as broken before this patch as after,
     unless it is really weird (handles continuous 1's more
     gracefully than continuous 0's).

        According to the my book, mode 3 isn't normally
     used to create interrupts: "The generated square-wave signal
     can be used, for example, to trasmit data via serial interfaces.
     The PIT then operates as a baud rate generator.  In the PC,
     counters 1 and 2 are operated in mode 3 to drive memory refresh
     and the speaker, respectively." (page 369)

        I wouldn't be inclined to worry about the 50 percent duty
     cycle too much unless someone comes up with a real guest OS
     that depends on it.

   * To be correct there are probably some cases when the 8254 should
     force the IRQ0 line when the guest is setting up the 8254.  Based on
     a very quick reading of some of the 8254 section of my book:

      * Note that modes 0, 2, 3 and 4 look usable with a hard-wired GATE=1,
        but not modes 1 or 5.
      * Maybe force IRQ0=0 when the 8254 is disabled (however that is done;
        I haven't found it yet)?
      * Force IRQ0=0 when starting the timer in "mode 0" (one-shot).
      * Modes 2, 3, and 4 all apparently start with IRQ0=1.  I guess
        they generate an interrupt when first enabled?
         * Mode 2 (periodic) has a 99 percent duty cycle (high).
         * Mode 3 (periodic) a 50 percent duty cycle (see above).
         * Mode 4 (one-shot) is distinguished from mode 0 in that
           it generates both a high-low and low-high transition when
           it expires, instead of just a low-high.

   * I don't know anything about the HPET.  I didn't even realize
     it re-uses IRQ0.

   * If you back-migrate from after this change to before this change,
     maybe there is a risk that it will lose one timer interrupt?
     Forward migration shouldn't have an issue (the first trailing
     edge in the new code becomes a nop).  Perhaps hack it to encourage
     extra interrupts in some migration cases rather than lost interrupts
     in others?  Some kind of subsection thing like was being
     discussed for the 8259 earlier?  Or:

Also, how big of a concern is a very rare gained or lost IRQ0
actually?  Under normal conditions, I would expect this to at most
cause a one time clock drift in the guest OS of a fraction of
a second.  If that only happens when rebooting or migrating the
guest...

On the other hand, lost or gained interrupts might be a more
serious concern (especially if lost) if the 8254 is operating
in a one-shot mode (0 or 4): Something in the guest doesn't
stop (hangs) if not canceled by the interrupt.


Can anyone confirm or contradict any of this?  Other thoughts?

                - Matthew
Maciej W. Rozycki Sept. 13, 2012, 1:41 p.m. UTC | #11
On Wed, 12 Sep 2012, Matthew Ogilvie wrote:

> Also, how big of a concern is a very rare gained or lost IRQ0
> actually?  Under normal conditions, I would expect this to at most
> cause a one time clock drift in the guest OS of a fraction of
> a second.  If that only happens when rebooting or migrating the
> guest...

 It depends on how you define "very rare".  Once per month or probably 
even per day is probably acceptable although you'll see a disruption in 
the system clock.  This is still likely unwanted if the system is used as 
a clock reference and not just wants to keep its clock right for own 
purposes.  Anything more frequent and NTP does care very much; an accurate 
system clock is important in many uses, starting from basic ones such as 
where timestamps of files exported over NFS are concerned.

 Speaking of real hw -- I don't know whether that really matters for 
emulated systems.  Thanks for looking into the 8254 PIT in details.

  Maciej
Jan Kiszka Sept. 13, 2012, 1:49 p.m. UTC | #12
On 2012-09-13 15:41, Maciej W. Rozycki wrote:
> On Wed, 12 Sep 2012, Matthew Ogilvie wrote:
> 
>> Also, how big of a concern is a very rare gained or lost IRQ0
>> actually?  Under normal conditions, I would expect this to at most
>> cause a one time clock drift in the guest OS of a fraction of
>> a second.  If that only happens when rebooting or migrating the
>> guest...
> 
>  It depends on how you define "very rare".  Once per month or probably 
> even per day is probably acceptable although you'll see a disruption in 
> the system clock.  This is still likely unwanted if the system is used as 
> a clock reference and not just wants to keep its clock right for own 
> purposes.  Anything more frequent and NTP does care very much; an accurate 
> system clock is important in many uses, starting from basic ones such as 
> where timestamps of files exported over NFS are concerned.
> 
>  Speaking of real hw -- I don't know whether that really matters for 
> emulated systems.  Thanks for looking into the 8254 PIT in details.

First correct, then fast. That rule applies at least to the conceptual
phase. Also, for rarely used PIT modes, I would refrain from optimizing
them away from the specified behaviour.

Jan
Jan Kiszka Sept. 13, 2012, 1:55 p.m. UTC | #13
On 2012-09-13 07:49, Matthew Ogilvie wrote:
> On Wed, Sep 12, 2012 at 10:57:57AM +0200, Jan Kiszka wrote:
>> On 2012-09-12 10:51, Avi Kivity wrote:
>>> On 09/12/2012 11:48 AM, Jan Kiszka wrote:
>>>> On 2012-09-12 10:01, Avi Kivity wrote:
>>>>> On 09/10/2012 04:29 AM, Matthew Ogilvie wrote:
>>>>>> Intel's definition of "edge triggered" means: "asserted with a
>>>>>> low-to-high transition at the time an interrupt is registered
>>>>>> and then kept high until the interrupt is served via one of the
>>>>>> EOI mechanisms or goes away unhandled."
>>>>>>
>>>>>> So the only difference between edge triggered and level triggered
>>>>>> is in the leading edge, with no difference in the trailing edge.
>>>>>>
>>>>>> This bug manifested itself when the guest was Microport UNIX
>>>>>> System V/386 v2.1 (ca. 1987), because it would sometimes mask
>>>>>> off IRQ14 in the slave IMR after it had already been asserted.
>>>>>> The master would still try to deliver an interrupt even though
>>>>>> IRQ2 had dropped again, resulting in a spurious interupt
>>>>>> (IRQ15) and a panicked UNIX kernel.
>>>>>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
>>>>>> index adba28f..5cbba99 100644
>>>>>> --- a/arch/x86/kvm/i8254.c
>>>>>> +++ b/arch/x86/kvm/i8254.c
>>>>>> @@ -302,8 +302,12 @@ static void pit_do_work(struct kthread_work *work)
>>>>>>  	}
>>>>>>  	spin_unlock(&ps->inject_lock);
>>>>>>  	if (inject) {
>>>>>> -		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
>>>>>> +		/* Clear previous interrupt, then create a rising
>>>>>> +		 * edge to request another interupt, and leave it at
>>>>>> +		 * level=1 until time to inject another one.
>>>>>> +		 */
>>>>>>  		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0);
>>>>>> +		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
>>>>>>  
>>>>>>  		/*
>>>>>
>>>>> I thought I understood this, now I'm not sure.  How can this be correct?
>>>>>  Real hardware doesn't act like this.
>>>>>
>>>>> What if the PIT is disabled after this?  You're injecting a spurious
>>>>> interrupt then.
>>>>
>>>> Yes, the PIT has to raise the output as long as specified, i.e.
>>>> according to the datasheet. That's important now due to the corrections
>>>> to the PIC. We can then carefully check if there is room for
>>>> simplifications / optimizations. I also cannot imagine that the above
>>>> already fulfills these requirements.
>>>>
>>>> And if the PIT is disabled by the HPET, we need to clear the output
>>>> explicitly as we inject the IRQ#0 under a different source ID than
>>>> userspace HPET does (which will logically take over IRQ#0 control). The
>>>> kernel would otherwise OR both sources to an incorrect result.
>>>>
>>>
>>> I guess we need to double the hrtimer rate then in order to generate a
>>> square wave.  It's getting ridiculous how accurate our model needs to be.
>>
>> I would suggest to solve this for the userspace model first, ensure that
>> it works properly in all modes, maybe optimize it, and then decide how
>> to map all this on kernel space. As long as we have two models, we can
>> also make use of them.
> 
> Thoughts about the 8254 PIT:
> 
> First, this summary of (real) 8254 PIT behavior seems fairly
> good, as far it goes:
> 
> On Tue, Sep 04, 2012 at 07:27:38PM +0100, Maciej W. Rozycki wrote:
>>  * The 8254 PIT is normally configured in mode 2 or 3 in the PC/AT
>>    architecture.  In the former its output is high (active) all the time
>>    except from one (last) clock cycle.  In the latter a wave that has a
>>    duty cycle close or equal to 0.5 (depending on whether the divider is
>>    odd or even) is produced, so no short pulses either.  I don't remember
>>    the other four modes -- have a look at the datasheet if interested, but
>>    I reckon they're not really compatible with the wiring anyway, e.g. the
>>    gate is hardwired enabled.
> 
> I've also just skimmed parts of the 8254 section of "The Indispensable PC
> Hardware Book", by Hans-Peter Messmer, Copyright 1994 Addison-Wesley,
> although I probably ought to read it more carefully.

http://download.intel.com/design/archives/periphrl/docs/23124406.pdf
should be the primary reference - as long as it leaves no open questions.

> 
> Under "normal" conditions, the 8254 part of the patch above should be
> indistinguishable from previous behavior.  The 8259's IRR will
> still show up as 1 until the interrupt is actually serviced,
> and no new interrupt will be serviced after one is serviced until
> another edge is injected via the high-low-high transition of the new
> code.  (Unless the guest resets the 8259 or maybe messes with IMR,
> but real hardware would generate extra interrupts in such cases as
> well.)
> 
> The new code sounds much closer to mode 2 described by
> Maciej, compared to the old code - except the duty cycle is
> effectively 100 percent instead of 99.[some number of 9's] percent.
> 
> -----------------
> But there might be some concerns in abnormal conditions:
> 
>    * If some guest is actually depending on a 50 percent duty cycle
>      (maybe some kind of polling rather than interrupts), I would
>      expect it to be just as broken before this patch as after,
>      unless it is really weird (handles continuous 1's more
>      gracefully than continuous 0's).
> 
>         According to the my book, mode 3 isn't normally
>      used to create interrupts: "The generated square-wave signal
>      can be used, for example, to trasmit data via serial interfaces.
>      The PIT then operates as a baud rate generator.  In the PC,
>      counters 1 and 2 are operated in mode 3 to drive memory refresh
>      and the speaker, respectively." (page 369)
> 
>         I wouldn't be inclined to worry about the 50 percent duty
>      cycle too much unless someone comes up with a real guest OS
>      that depends on it.
> 
>    * To be correct there are probably some cases when the 8254 should
>      force the IRQ0 line when the guest is setting up the 8254.  Based on
>      a very quick reading of some of the 8254 section of my book:
> 
>       * Note that modes 0, 2, 3 and 4 look usable with a hard-wired GATE=1,
>         but not modes 1 or 5.
>       * Maybe force IRQ0=0 when the 8254 is disabled (however that is done;
>         I haven't found it yet)?
>       * Force IRQ0=0 when starting the timer in "mode 0" (one-shot).
>       * Modes 2, 3, and 4 all apparently start with IRQ0=1.  I guess
>         they generate an interrupt when first enabled?
>          * Mode 2 (periodic) has a 99 percent duty cycle (high).
>          * Mode 3 (periodic) a 50 percent duty cycle (see above).
>          * Mode 4 (one-shot) is distinguished from mode 0 in that
>            it generates both a high-low and low-high transition when
>            it expires, instead of just a low-high.
> 
>    * I don't know anything about the HPET.  I didn't even realize
>      it re-uses IRQ0.

In modern chipsets, the HPET can drive IRQ0 instead of the PIT. A muxer
selects the input for that purpose. And we disable the timer of the PIT
in that case to avoid needless background activity. See e.g.
pit_irq_control().

> 
>    * If you back-migrate from after this change to before this change,
>      maybe there is a risk that it will lose one timer interrupt?
>      Forward migration shouldn't have an issue (the first trailing
>      edge in the new code becomes a nop).  Perhaps hack it to encourage
>      extra interrupts in some migration cases rather than lost interrupts
>      in others?  Some kind of subsection thing like was being
>      discussed for the 8259 earlier?  Or:
> 
> Also, how big of a concern is a very rare gained or lost IRQ0
> actually?  Under normal conditions, I would expect this to at most
> cause a one time clock drift in the guest OS of a fraction of
> a second.  If that only happens when rebooting or migrating the
> guest...
> 
> On the other hand, lost or gained interrupts might be a more
> serious concern (especially if lost) if the 8254 is operating
> in a one-shot mode (0 or 4): Something in the guest doesn't
> stop (hangs) if not canceled by the interrupt.
> 
> 
> Can anyone confirm or contradict any of this?  Other thoughts?

See my other reply: Until we really understand the impact of some
optimization / simplification, we are forced to do it accurately.

Jan
Maciej W. Rozycki Sept. 13, 2012, 3:48 p.m. UTC | #14
On Thu, 13 Sep 2012, Jan Kiszka wrote:

> > I've also just skimmed parts of the 8254 section of "The Indispensable PC
> > Hardware Book", by Hans-Peter Messmer, Copyright 1994 Addison-Wesley,
> > although I probably ought to read it more carefully.
> 
> http://download.intel.com/design/archives/periphrl/docs/23124406.pdf
> should be the primary reference - as long as it leaves no open questions.

 Oh, I'm glad they've put it online after all, so there's an ultimate 
place to refer to.  I've only got a copy of this datasheet I got from 
Intel on a CD some 15 years ago.

 And for the record -- they used to publish the 8259A datasheet as well, 
but it appears to have gone from its place.  However it can be easily 
tracked down by an Internet search engine of your choice by referring to 
its order # as 231468.pdf (no revision number embedded there in the file 
name as there was none as it was originally published either).

  Maciej
diff mbox

Patch

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index adba28f..5cbba99 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -302,8 +302,12 @@  static void pit_do_work(struct kthread_work *work)
 	}
 	spin_unlock(&ps->inject_lock);
 	if (inject) {
-		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
+		/* Clear previous interrupt, then create a rising
+		 * edge to request another interupt, and leave it at
+		 * level=1 until time to inject another one.
+		 */
 		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0);
+		kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
 
 		/*
 		 * Provides NMI watchdog support via Virtual Wire mode.
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index e498b18..b0c2346 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -111,8 +111,10 @@  static inline int pic_set_irq1(struct kvm_kpic_state *s, int irq, int level)
 				s->irr |= mask;
 			}
 			s->last_irr |= mask;
-		} else
+		} else {
+			s->irr &= ~mask;
 			s->last_irr &= ~mask;
+		}
 
 	return (s->imr & mask) ? -1 : ret;
 }
@@ -169,14 +171,10 @@  static void pic_update_irq(struct kvm_pic *s)
 {
 	int irq2, irq;
 
+	/* slave PIC notifies master PIC via IRQ2 */
 	irq2 = pic_get_irq(&s->pics[1]);
-	if (irq2 >= 0) {
-		/*
-		 * if irq request by slave pic, signal master PIC
-		 */
-		pic_set_irq1(&s->pics[0], 2, 1);
-		pic_set_irq1(&s->pics[0], 2, 0);
-	}
+	pic_set_irq1(&s->pics[0], 2, irq2 >= 0);
+
 	irq = pic_get_irq(&s->pics[0]);
 	pic_irq_request(s->kvm, irq >= 0);
 }