Patchwork [RFT,07/15] qemu_irq: Add IRQ handlers with delivery feedback

login
register
mail settings
Submitter Jan Kiszka
Date May 24, 2010, 8:13 p.m.
Message ID <e39866bcfe2ebc72f9a02c95d308271036b560f5.1274732025.git.jan.kiszka@web.de>
Download mbox | patch
Permalink /patch/53463/
State New
Headers show

Comments

Jan Kiszka - May 24, 2010, 8:13 p.m.
From: Jan Kiszka <jan.kiszka@siemens.com>

This allows to communicate potential IRQ coalescing during delivery from
the sink back to the source. Targets that support IRQ coalescing
workarounds need to register handlers that return the appropriate
QEMU_IRQ_* code, and they have to propergate the code across all IRQ
redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
apply its workaround. If multiple sinks exist, the source may only
consider an IRQ coalesced if all other sinks either report
QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/irq.c |   38 +++++++++++++++++++++++++++++---------
 hw/irq.h |   22 +++++++++++++++-------
 2 files changed, 44 insertions(+), 16 deletions(-)
Gleb Natapov - May 25, 2010, 6:07 a.m.
On Mon, May 24, 2010 at 10:13:40PM +0200, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
> 
> This allows to communicate potential IRQ coalescing during delivery from
> the sink back to the source. Targets that support IRQ coalescing
> workarounds need to register handlers that return the appropriate
> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> apply its workaround. If multiple sinks exist, the source may only
> consider an IRQ coalesced if all other sinks either report
> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> 
Well, almost two years passed since this approach was proposed first
time[1] ;). Back then it generated bunch of nonsensical comments about
real hardware not working this way, so the hack that we have now was
introduce to overcome this resistance. I hope enough time passed for
people to gain some sense and the approach will be adopted this time.
Really this should have been done two year ago.

[1] http://lists.nongnu.org/archive/html/qemu-devel/2008-06/msg00757.html

--
			Gleb.
Jan Kiszka - May 25, 2010, 6:31 a.m.
Gleb Natapov wrote:
> On Mon, May 24, 2010 at 10:13:40PM +0200, Jan Kiszka wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> This allows to communicate potential IRQ coalescing during delivery from
>> the sink back to the source. Targets that support IRQ coalescing
>> workarounds need to register handlers that return the appropriate
>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>> apply its workaround. If multiple sinks exist, the source may only
>> consider an IRQ coalesced if all other sinks either report
>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>
> Well, almost two years passed since this approach was proposed first
> time[1] ;). Back then it generated bunch of nonsensical comments about
> real hardware not working this way, so the hack that we have now was
> introduce to overcome this resistance. I hope enough time passed for
> people to gain some sense and the approach will be adopted this time.
> Really this should have been done two year ago.
> 
> [1] http://lists.nongnu.org/archive/html/qemu-devel/2008-06/msg00757.html

Yeah, I somehow had a vague feeling that there must have been an earlier
attempt. I think my approach could be a slightly easier to accept as it
does not require converting all platforms, but this can happen on demand
(or not at all). Moreover, I think that the third return state,
QEMU_IRQ_MASKED, is important for correct handling of multiple IRQ sinks.

However, as I would see it now, we just have two options long term:
 - drop IRQ coalescing workarounds
 - properly support them via qemu_irq

The current hack cannot stay. E.g., it does not scale because it depends
on a global variable of the APIC. So we would never able to protect the
APICs with individual locks.

Jan
Gleb Natapov - May 25, 2010, 6:40 a.m.
On Tue, May 25, 2010 at 08:31:06AM +0200, Jan Kiszka wrote:
> Gleb Natapov wrote:
> > On Mon, May 24, 2010 at 10:13:40PM +0200, Jan Kiszka wrote:
> >> From: Jan Kiszka <jan.kiszka@siemens.com>
> >>
> >> This allows to communicate potential IRQ coalescing during delivery from
> >> the sink back to the source. Targets that support IRQ coalescing
> >> workarounds need to register handlers that return the appropriate
> >> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> >> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> >> apply its workaround. If multiple sinks exist, the source may only
> >> consider an IRQ coalesced if all other sinks either report
> >> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> >>
> > Well, almost two years passed since this approach was proposed first
> > time[1] ;). Back then it generated bunch of nonsensical comments about
> > real hardware not working this way, so the hack that we have now was
> > introduce to overcome this resistance. I hope enough time passed for
> > people to gain some sense and the approach will be adopted this time.
> > Really this should have been done two year ago.
> > 
> > [1] http://lists.nongnu.org/archive/html/qemu-devel/2008-06/msg00757.html
> 
> Yeah, I somehow had a vague feeling that there must have been an earlier
> attempt. I think my approach could be a slightly easier to accept as it
> does not require converting all platforms, but this can happen on demand
> (or not at all).
I proposed that too at some point (to lazy to look for it in archives
and it was in words not patch). But since resistance to that approach
from the beginning was baseless no sane arguments or compromises would
help at that point.

>                   Moreover, I think that the third return state,
> QEMU_IRQ_MASKED, is important for correct handling of multiple IRQ sinks.
My patch had that too: <0 = coalesced, 0 = masked, >0=delivered

> 
> However, as I would see it now, we just have two options long term:
>  - drop IRQ coalescing workarounds
>  - properly support them via qemu_irq
> 
> The current hack cannot stay. E.g., it does not scale because it depends
> on a global variable of the APIC. So we would never able to protect the
> APICs with individual locks.
> 
Agree, and that was well understood at the time the hack was introduced.
But we can't just drop RTC IRQ reinjectoin. It would be crippling of qemu.
So we have only _one_ option:
 - properly support them via qemu_irq

--
			Gleb.
Jan Kiszka - May 25, 2010, 6:54 a.m.
Gleb Natapov wrote:
> On Tue, May 25, 2010 at 08:31:06AM +0200, Jan Kiszka wrote:
>> Gleb Natapov wrote:
>>> On Mon, May 24, 2010 at 10:13:40PM +0200, Jan Kiszka wrote:
>>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>>
>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>> the sink back to the source. Targets that support IRQ coalescing
>>>> workarounds need to register handlers that return the appropriate
>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>> apply its workaround. If multiple sinks exist, the source may only
>>>> consider an IRQ coalesced if all other sinks either report
>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>
>>> Well, almost two years passed since this approach was proposed first
>>> time[1] ;). Back then it generated bunch of nonsensical comments about
>>> real hardware not working this way, so the hack that we have now was
>>> introduce to overcome this resistance. I hope enough time passed for
>>> people to gain some sense and the approach will be adopted this time.
>>> Really this should have been done two year ago.
>>>
>>> [1] http://lists.nongnu.org/archive/html/qemu-devel/2008-06/msg00757.html
>> Yeah, I somehow had a vague feeling that there must have been an earlier
>> attempt. I think my approach could be a slightly easier to accept as it
>> does not require converting all platforms, but this can happen on demand
>> (or not at all).
> I proposed that too at some point (to lazy to look for it in archives
> and it was in words not patch). But since resistance to that approach
> from the beginning was baseless no sane arguments or compromises would
> help at that point.
> 
>>                   Moreover, I think that the third return state,
>> QEMU_IRQ_MASKED, is important for correct handling of multiple IRQ sinks.
> My patch had that too: <0 = coalesced, 0 = masked, >0=delivered

Oh, indeed! Good to see that we came up with the same logic.

> 
>> However, as I would see it now, we just have two options long term:
>>  - drop IRQ coalescing workarounds
>>  - properly support them via qemu_irq
>>
>> The current hack cannot stay. E.g., it does not scale because it depends
>> on a global variable of the APIC. So we would never able to protect the
>> APICs with individual locks.
>>
> Agree, and that was well understood at the time the hack was introduced.
> But we can't just drop RTC IRQ reinjectoin. It would be crippling of qemu.
> So we have only _one_ option:
>  - properly support them via qemu_irq

You won't hear me disagreeing.

Jan
Blue Swirl - May 25, 2010, 7:09 p.m.
On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
>
> This allows to communicate potential IRQ coalescing during delivery from
> the sink back to the source. Targets that support IRQ coalescing
> workarounds need to register handlers that return the appropriate
> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> apply its workaround. If multiple sinks exist, the source may only
> consider an IRQ coalesced if all other sinks either report
> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.

No real devices are interested whether any of their output lines are
even connected. This would introduce a new signal type, bidirectional
multi-level, which is not correct.

I think the real solution to coalescing is put the logic inside one
device, in this case APIC because it has the information about irq
delivery. APIC could monitor incoming RTC irqs for frequency
information and whether they get delivered or not. If not, an internal
timer is installed which injects the lost irqs.

Of course, no real device could do such de-coalescing, but with this
approach, the voodoo is contained to insides of one device, APIC.

We should also take a step back to think what was the cause of lost
irqs, IIRC uneven execution rate in QEMU. Could this be fixed or taken
into account in timer handling? For example, CPU loop could analyze
the wall clock time between CPU exits and use that to offset the
timers. Thus the timer frequency (in wall clock time) could be made to
correspond a bit more to VCPU execution rate.

>
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
>  hw/irq.c |   38 +++++++++++++++++++++++++++++---------
>  hw/irq.h |   22 +++++++++++++++-------
>  2 files changed, 44 insertions(+), 16 deletions(-)
>
> diff --git a/hw/irq.c b/hw/irq.c
> index 7703f62..db2cce6 100644
> --- a/hw/irq.c
> +++ b/hw/irq.c
> @@ -26,19 +26,27 @@
>
>  struct IRQState {
>     qemu_irq_handler handler;
> +    qemu_irq_fb_handler feedback_handler;
>     void *opaque;
>     int n;
>  };
>
> -void qemu_set_irq(qemu_irq irq, int level)
> +int qemu_set_irq(qemu_irq irq, int level)
>  {
> -    if (!irq)
> -        return;
> -
> -    irq->handler(irq->opaque, irq->n, level);
> +    if (!irq) {
> +        return 0;
> +    }
> +    if (irq->feedback_handler) {
> +        return irq->feedback_handler(irq->opaque, irq->n, level);
> +    } else {
> +        irq->handler(irq->opaque, irq->n, level);
> +        return QEMU_IRQ_DELIVERED;
> +    }
>  }
>
> -qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
> +static qemu_irq *allocate_irqs(qemu_irq_handler handler,
> +                               qemu_irq_fb_handler feedback_handler,
> +                               void *opaque, int n)
>  {
>     qemu_irq *s;
>     struct IRQState *p;
> @@ -48,6 +56,7 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>     p = (struct IRQState *)qemu_mallocz(sizeof(struct IRQState) * n);
>     for (i = 0; i < n; i++) {
>         p->handler = handler;
> +        p->feedback_handler = feedback_handler;
>         p->opaque = opaque;
>         p->n = i;
>         s[i] = p;
> @@ -56,22 +65,33 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>     return s;
>  }
>
> +qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
> +{
> +    return allocate_irqs(handler, NULL, opaque, n);
> +}
> +
> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
> +                                      void *opaque, int n)
> +{
> +    return allocate_irqs(NULL, handler, opaque, n);
> +}
> +
>  void qemu_free_irqs(qemu_irq *s)
>  {
>     qemu_free(s[0]);
>     qemu_free(s);
>  }
>
> -static void qemu_notirq(void *opaque, int line, int level)
> +static int qemu_notirq(void *opaque, int line, int level)
>  {
>     struct IRQState *irq = opaque;
>
> -    irq->handler(irq->opaque, irq->n, !level);
> +    return qemu_set_irq(irq, !level);
>  }
>
>  qemu_irq qemu_irq_invert(qemu_irq irq)
>  {
>     /* The default state for IRQs is low, so raise the output now.  */
>     qemu_irq_raise(irq);
> -    return qemu_allocate_irqs(qemu_notirq, irq, 1)[0];
> +    return allocate_irqs(NULL, qemu_notirq, irq, 1)[0];
>  }
> diff --git a/hw/irq.h b/hw/irq.h
> index 5daae44..eee03e6 100644
> --- a/hw/irq.h
> +++ b/hw/irq.h
> @@ -3,15 +3,18 @@
>
>  /* Generic IRQ/GPIO pin infrastructure.  */
>
> -/* FIXME: Rmove one of these.  */
> +#define QEMU_IRQ_DELIVERED      0
> +#define QEMU_IRQ_COALESCED      (-1)
> +#define QEMU_IRQ_MASKED         (-2)
> +
>  typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
> -typedef void SetIRQFunc(void *opaque, int irq_num, int level);
> +typedef int (*qemu_irq_fb_handler)(void *opaque, int n, int level);
>
> -void qemu_set_irq(qemu_irq irq, int level);
> +int qemu_set_irq(qemu_irq irq, int level);
>
> -static inline void qemu_irq_raise(qemu_irq irq)
> +static inline int qemu_irq_raise(qemu_irq irq)
>  {
> -    qemu_set_irq(irq, 1);
> +    return qemu_set_irq(irq, 1);
>  }
>
>  static inline void qemu_irq_lower(qemu_irq irq)
> @@ -19,14 +22,19 @@ static inline void qemu_irq_lower(qemu_irq irq)
>     qemu_set_irq(irq, 0);
>  }
>
> -static inline void qemu_irq_pulse(qemu_irq irq)
> +static inline int qemu_irq_pulse(qemu_irq irq)
>  {
> -    qemu_set_irq(irq, 1);
> +    int ret;
> +
> +    ret = qemu_set_irq(irq, 1);
>     qemu_set_irq(irq, 0);
> +    return ret;
>  }
>
>  /* Returns an array of N IRQs.  */
>  qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n);
> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
> +                                      void *opaque, int n);
>  void qemu_free_irqs(qemu_irq *s);
>
>  /* Returns a new IRQ with opposite polarity.  */
> --
> 1.6.0.2
>
>
Anthony Liguori - May 25, 2010, 8:16 p.m.
On 05/25/2010 02:09 PM, Blue Swirl wrote:
> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>    
>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>
>> This allows to communicate potential IRQ coalescing during delivery from
>> the sink back to the source. Targets that support IRQ coalescing
>> workarounds need to register handlers that return the appropriate
>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>> apply its workaround. If multiple sinks exist, the source may only
>> consider an IRQ coalesced if all other sinks either report
>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>      
> No real devices are interested whether any of their output lines are
> even connected. This would introduce a new signal type, bidirectional
> multi-level, which is not correct.
>    

I don't think it's really an issue of correct, but I wouldn't disagree 
to a suggestion that we ought to introduce a new signal type for this 
type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a 
similar interface as qemu_irq.

> I think the real solution to coalescing is put the logic inside one
> device, in this case APIC because it has the information about irq
> delivery. APIC could monitor incoming RTC irqs for frequency
> information and whether they get delivered or not. If not, an internal
> timer is installed which injects the lost irqs.
>
> Of course, no real device could do such de-coalescing, but with this
> approach, the voodoo is contained to insides of one device, APIC.
>
> We should also take a step back to think what was the cause of lost
> irqs, IIRC uneven execution rate in QEMU.

Not only that.  The pathological case is where a host is limited to a 
1khz timer frequency and the guest requests a 1khz timer frequency.  
Practically speaking, there is no way we'll ever be able to adjust 
timers to reinject lost interrupts because of the host timer limitation.

>   Could this be fixed or taken
> into account in timer handling? For example, CPU loop could analyze
> the wall clock time between CPU exits and use that to offset the
> timers. Thus the timer frequency (in wall clock time) could be made to
> correspond a bit more to VCPU execution rate.
>    

A lot of what motivates the timer reinjection work is very old linux 
kernels that had fixed userspace timer frequencies.  On newer host 
kernels, it's probably not nearly as important except when you get into 
pathological cases like exposing a high frequency HPET timer to the 
guest where you cannot keep up with the host.

Regards,

Anthony Liguori

>> Signed-off-by: Jan Kiszka<jan.kiszka@siemens.com>
>> ---
>>   hw/irq.c |   38 +++++++++++++++++++++++++++++---------
>>   hw/irq.h |   22 +++++++++++++++-------
>>   2 files changed, 44 insertions(+), 16 deletions(-)
>>
>> diff --git a/hw/irq.c b/hw/irq.c
>> index 7703f62..db2cce6 100644
>> --- a/hw/irq.c
>> +++ b/hw/irq.c
>> @@ -26,19 +26,27 @@
>>
>>   struct IRQState {
>>      qemu_irq_handler handler;
>> +    qemu_irq_fb_handler feedback_handler;
>>      void *opaque;
>>      int n;
>>   };
>>
>> -void qemu_set_irq(qemu_irq irq, int level)
>> +int qemu_set_irq(qemu_irq irq, int level)
>>   {
>> -    if (!irq)
>> -        return;
>> -
>> -    irq->handler(irq->opaque, irq->n, level);
>> +    if (!irq) {
>> +        return 0;
>> +    }
>> +    if (irq->feedback_handler) {
>> +        return irq->feedback_handler(irq->opaque, irq->n, level);
>> +    } else {
>> +        irq->handler(irq->opaque, irq->n, level);
>> +        return QEMU_IRQ_DELIVERED;
>> +    }
>>   }
>>
>> -qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>> +static qemu_irq *allocate_irqs(qemu_irq_handler handler,
>> +                               qemu_irq_fb_handler feedback_handler,
>> +                               void *opaque, int n)
>>   {
>>      qemu_irq *s;
>>      struct IRQState *p;
>> @@ -48,6 +56,7 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>>      p = (struct IRQState *)qemu_mallocz(sizeof(struct IRQState) * n);
>>      for (i = 0; i<  n; i++) {
>>          p->handler = handler;
>> +        p->feedback_handler = feedback_handler;
>>          p->opaque = opaque;
>>          p->n = i;
>>          s[i] = p;
>> @@ -56,22 +65,33 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>>      return s;
>>   }
>>
>> +qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>> +{
>> +    return allocate_irqs(handler, NULL, opaque, n);
>> +}
>> +
>> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
>> +                                      void *opaque, int n)
>> +{
>> +    return allocate_irqs(NULL, handler, opaque, n);
>> +}
>> +
>>   void qemu_free_irqs(qemu_irq *s)
>>   {
>>      qemu_free(s[0]);
>>      qemu_free(s);
>>   }
>>
>> -static void qemu_notirq(void *opaque, int line, int level)
>> +static int qemu_notirq(void *opaque, int line, int level)
>>   {
>>      struct IRQState *irq = opaque;
>>
>> -    irq->handler(irq->opaque, irq->n, !level);
>> +    return qemu_set_irq(irq, !level);
>>   }
>>
>>   qemu_irq qemu_irq_invert(qemu_irq irq)
>>   {
>>      /* The default state for IRQs is low, so raise the output now.  */
>>      qemu_irq_raise(irq);
>> -    return qemu_allocate_irqs(qemu_notirq, irq, 1)[0];
>> +    return allocate_irqs(NULL, qemu_notirq, irq, 1)[0];
>>   }
>> diff --git a/hw/irq.h b/hw/irq.h
>> index 5daae44..eee03e6 100644
>> --- a/hw/irq.h
>> +++ b/hw/irq.h
>> @@ -3,15 +3,18 @@
>>
>>   /* Generic IRQ/GPIO pin infrastructure.  */
>>
>> -/* FIXME: Rmove one of these.  */
>> +#define QEMU_IRQ_DELIVERED      0
>> +#define QEMU_IRQ_COALESCED      (-1)
>> +#define QEMU_IRQ_MASKED         (-2)
>> +
>>   typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
>> -typedef void SetIRQFunc(void *opaque, int irq_num, int level);
>> +typedef int (*qemu_irq_fb_handler)(void *opaque, int n, int level);
>>
>> -void qemu_set_irq(qemu_irq irq, int level);
>> +int qemu_set_irq(qemu_irq irq, int level);
>>
>> -static inline void qemu_irq_raise(qemu_irq irq)
>> +static inline int qemu_irq_raise(qemu_irq irq)
>>   {
>> -    qemu_set_irq(irq, 1);
>> +    return qemu_set_irq(irq, 1);
>>   }
>>
>>   static inline void qemu_irq_lower(qemu_irq irq)
>> @@ -19,14 +22,19 @@ static inline void qemu_irq_lower(qemu_irq irq)
>>      qemu_set_irq(irq, 0);
>>   }
>>
>> -static inline void qemu_irq_pulse(qemu_irq irq)
>> +static inline int qemu_irq_pulse(qemu_irq irq)
>>   {
>> -    qemu_set_irq(irq, 1);
>> +    int ret;
>> +
>> +    ret = qemu_set_irq(irq, 1);
>>      qemu_set_irq(irq, 0);
>> +    return ret;
>>   }
>>
>>   /* Returns an array of N IRQs.  */
>>   qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n);
>> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
>> +                                      void *opaque, int n);
>>   void qemu_free_irqs(qemu_irq *s);
>>
>>   /* Returns a new IRQ with opposite polarity.  */
>> --
>> 1.6.0.2
>>
>>
>>      
>
Jan Kiszka - May 25, 2010, 9:44 p.m.
Anthony Liguori wrote:
> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>   
>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>
>>> This allows to communicate potential IRQ coalescing during delivery from
>>> the sink back to the source. Targets that support IRQ coalescing
>>> workarounds need to register handlers that return the appropriate
>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>> apply its workaround. If multiple sinks exist, the source may only
>>> consider an IRQ coalesced if all other sinks either report
>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>      
>> No real devices are interested whether any of their output lines are
>> even connected. This would introduce a new signal type, bidirectional
>> multi-level, which is not correct.
>>    
> 
> I don't think it's really an issue of correct, but I wouldn't disagree
> to a suggestion that we ought to introduce a new signal type for this
> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
> similar interface as qemu_irq.

A separate type would complicate the delivery of the feedback value
across GPIO pins (as Paul requested for the RTC->HPET routing).

> 
>> I think the real solution to coalescing is put the logic inside one
>> device, in this case APIC because it has the information about irq
>> delivery. APIC could monitor incoming RTC irqs for frequency
>> information and whether they get delivered or not. If not, an internal
>> timer is installed which injects the lost irqs.

That won't fly as the IRQs will already arrive at the APIC with a
sufficiently high jitter. At the bare minimum, you need to tell the
interrupt controller about the fact that a particular IRQ should be
delivered at a specific regular rate. For this, you also need a generic
interface - nothing really "won".

Jan
Gleb Natapov - May 26, 2010, 8:08 a.m.
On Tue, May 25, 2010 at 11:44:50PM +0200, Jan Kiszka wrote:
> > 
> >> I think the real solution to coalescing is put the logic inside one
> >> device, in this case APIC because it has the information about irq
> >> delivery. APIC could monitor incoming RTC irqs for frequency
> >> information and whether they get delivered or not. If not, an internal
> >> timer is installed which injects the lost irqs.
> 
> That won't fly as the IRQs will already arrive at the APIC with a
> sufficiently high jitter. At the bare minimum, you need to tell the
> interrupt controller about the fact that a particular IRQ should be
> delivered at a specific regular rate. For this, you also need a generic
> interface - nothing really "won".
> 
Not only that. Suppose RTC runs with 64Hz frequency and N interrupts
were coalesced during some period. Now the guest reprograms RTC to
1024Hz. N should be scaled accordingly otherwise reinjection will not
fix the drift. Do you propose to put this logic into APIC to?

--
			Gleb.
Blue Swirl - May 26, 2010, 7:49 p.m.
On Tue, May 25, 2010 at 8:16 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>
>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>
>>>
>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>
>>> This allows to communicate potential IRQ coalescing during delivery from
>>> the sink back to the source. Targets that support IRQ coalescing
>>> workarounds need to register handlers that return the appropriate
>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>> apply its workaround. If multiple sinks exist, the source may only
>>> consider an IRQ coalesced if all other sinks either report
>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>
>>
>> No real devices are interested whether any of their output lines are
>> even connected. This would introduce a new signal type, bidirectional
>> multi-level, which is not correct.
>>
>
> I don't think it's really an issue of correct, but I wouldn't disagree to a
> suggestion that we ought to introduce a new signal type for this type of
> bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a similar
> interface as qemu_irq.

I'd rather try harder to find a solution for the problem without this.

>> I think the real solution to coalescing is put the logic inside one
>> device, in this case APIC because it has the information about irq
>> delivery. APIC could monitor incoming RTC irqs for frequency
>> information and whether they get delivered or not. If not, an internal
>> timer is installed which injects the lost irqs.
>>
>> Of course, no real device could do such de-coalescing, but with this
>> approach, the voodoo is contained to insides of one device, APIC.
>>
>> We should also take a step back to think what was the cause of lost
>> irqs, IIRC uneven execution rate in QEMU.
>
> Not only that.  The pathological case is where a host is limited to a 1khz
> timer frequency and the guest requests a 1khz timer frequency.  Practically
> speaking, there is no way we'll ever be able to adjust timers to reinject
> lost interrupts because of the host timer limitation.

But APIC knows when the guest has acked a timer irq and so can
immediately issue another one to cover a lost irq. APIC could also
arrange a CPU exit. The irqs would not arrive at even rate, but in
this pathological case, the host can't ever schedule any delays less
than 1ms.

>>  Could this be fixed or taken
>> into account in timer handling? For example, CPU loop could analyze
>> the wall clock time between CPU exits and use that to offset the
>> timers. Thus the timer frequency (in wall clock time) could be made to
>> correspond a bit more to VCPU execution rate.
>>
>
> A lot of what motivates the timer reinjection work is very old linux kernels
> that had fixed userspace timer frequencies.  On newer host kernels, it's
> probably not nearly as important except when you get into pathological cases
> like exposing a high frequency HPET timer to the guest where you cannot keep
> up with the host.

There will always be some delays and overhead because of the
emulation. We can either hide this completely from the guest (by
stretching the guest time) or we can try to keep the guest time base
in synch with host but then the guest can't expect native class
performance to be available between timer irqs.

On a native hardware, you could have a 1kHz timer and a kernel task
always running for max. 999us with interrupts disabled, then for at
least 1us the interrupts are enabled and the timer interrupt handler
gets a chance to run. On QEMU, such a small time window may be
impossible to emulate. No coalescing would help in this case. With a
cycle counting emulation it's possible, but the speed would probably
be slower than native. With the interrupt coalescing you may be able
to cover some easy cases.

>
> Regards,
>
> Anthony Liguori
>
>>> Signed-off-by: Jan Kiszka<jan.kiszka@siemens.com>
>>> ---
>>>  hw/irq.c |   38 +++++++++++++++++++++++++++++---------
>>>  hw/irq.h |   22 +++++++++++++++-------
>>>  2 files changed, 44 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/hw/irq.c b/hw/irq.c
>>> index 7703f62..db2cce6 100644
>>> --- a/hw/irq.c
>>> +++ b/hw/irq.c
>>> @@ -26,19 +26,27 @@
>>>
>>>  struct IRQState {
>>>     qemu_irq_handler handler;
>>> +    qemu_irq_fb_handler feedback_handler;
>>>     void *opaque;
>>>     int n;
>>>  };
>>>
>>> -void qemu_set_irq(qemu_irq irq, int level)
>>> +int qemu_set_irq(qemu_irq irq, int level)
>>>  {
>>> -    if (!irq)
>>> -        return;
>>> -
>>> -    irq->handler(irq->opaque, irq->n, level);
>>> +    if (!irq) {
>>> +        return 0;
>>> +    }
>>> +    if (irq->feedback_handler) {
>>> +        return irq->feedback_handler(irq->opaque, irq->n, level);
>>> +    } else {
>>> +        irq->handler(irq->opaque, irq->n, level);
>>> +        return QEMU_IRQ_DELIVERED;
>>> +    }
>>>  }
>>>
>>> -qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int
>>> n)
>>> +static qemu_irq *allocate_irqs(qemu_irq_handler handler,
>>> +                               qemu_irq_fb_handler feedback_handler,
>>> +                               void *opaque, int n)
>>>  {
>>>     qemu_irq *s;
>>>     struct IRQState *p;
>>> @@ -48,6 +56,7 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler,
>>> void *opaque, int n)
>>>     p = (struct IRQState *)qemu_mallocz(sizeof(struct IRQState) * n);
>>>     for (i = 0; i<  n; i++) {
>>>         p->handler = handler;
>>> +        p->feedback_handler = feedback_handler;
>>>         p->opaque = opaque;
>>>         p->n = i;
>>>         s[i] = p;
>>> @@ -56,22 +65,33 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler
>>> handler, void *opaque, int n)
>>>     return s;
>>>  }
>>>
>>> +qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int
>>> n)
>>> +{
>>> +    return allocate_irqs(handler, NULL, opaque, n);
>>> +}
>>> +
>>> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
>>> +                                      void *opaque, int n)
>>> +{
>>> +    return allocate_irqs(NULL, handler, opaque, n);
>>> +}
>>> +
>>>  void qemu_free_irqs(qemu_irq *s)
>>>  {
>>>     qemu_free(s[0]);
>>>     qemu_free(s);
>>>  }
>>>
>>> -static void qemu_notirq(void *opaque, int line, int level)
>>> +static int qemu_notirq(void *opaque, int line, int level)
>>>  {
>>>     struct IRQState *irq = opaque;
>>>
>>> -    irq->handler(irq->opaque, irq->n, !level);
>>> +    return qemu_set_irq(irq, !level);
>>>  }
>>>
>>>  qemu_irq qemu_irq_invert(qemu_irq irq)
>>>  {
>>>     /* The default state for IRQs is low, so raise the output now.  */
>>>     qemu_irq_raise(irq);
>>> -    return qemu_allocate_irqs(qemu_notirq, irq, 1)[0];
>>> +    return allocate_irqs(NULL, qemu_notirq, irq, 1)[0];
>>>  }
>>> diff --git a/hw/irq.h b/hw/irq.h
>>> index 5daae44..eee03e6 100644
>>> --- a/hw/irq.h
>>> +++ b/hw/irq.h
>>> @@ -3,15 +3,18 @@
>>>
>>>  /* Generic IRQ/GPIO pin infrastructure.  */
>>>
>>> -/* FIXME: Rmove one of these.  */
>>> +#define QEMU_IRQ_DELIVERED      0
>>> +#define QEMU_IRQ_COALESCED      (-1)
>>> +#define QEMU_IRQ_MASKED         (-2)
>>> +
>>>  typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
>>> -typedef void SetIRQFunc(void *opaque, int irq_num, int level);
>>> +typedef int (*qemu_irq_fb_handler)(void *opaque, int n, int level);
>>>
>>> -void qemu_set_irq(qemu_irq irq, int level);
>>> +int qemu_set_irq(qemu_irq irq, int level);
>>>
>>> -static inline void qemu_irq_raise(qemu_irq irq)
>>> +static inline int qemu_irq_raise(qemu_irq irq)
>>>  {
>>> -    qemu_set_irq(irq, 1);
>>> +    return qemu_set_irq(irq, 1);
>>>  }
>>>
>>>  static inline void qemu_irq_lower(qemu_irq irq)
>>> @@ -19,14 +22,19 @@ static inline void qemu_irq_lower(qemu_irq irq)
>>>     qemu_set_irq(irq, 0);
>>>  }
>>>
>>> -static inline void qemu_irq_pulse(qemu_irq irq)
>>> +static inline int qemu_irq_pulse(qemu_irq irq)
>>>  {
>>> -    qemu_set_irq(irq, 1);
>>> +    int ret;
>>> +
>>> +    ret = qemu_set_irq(irq, 1);
>>>     qemu_set_irq(irq, 0);
>>> +    return ret;
>>>  }
>>>
>>>  /* Returns an array of N IRQs.  */
>>>  qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int
>>> n);
>>> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
>>> +                                      void *opaque, int n);
>>>  void qemu_free_irqs(qemu_irq *s);
>>>
>>>  /* Returns a new IRQ with opposite polarity.  */
>>> --
>>> 1.6.0.2
>>>
>>>
>>>
>>
>>
>
>
Blue Swirl - May 26, 2010, 7:55 p.m.
On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Anthony Liguori wrote:
>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>
>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>
>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>> the sink back to the source. Targets that support IRQ coalescing
>>>> workarounds need to register handlers that return the appropriate
>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>> apply its workaround. If multiple sinks exist, the source may only
>>>> consider an IRQ coalesced if all other sinks either report
>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>
>>> No real devices are interested whether any of their output lines are
>>> even connected. This would introduce a new signal type, bidirectional
>>> multi-level, which is not correct.
>>>
>>
>> I don't think it's really an issue of correct, but I wouldn't disagree
>> to a suggestion that we ought to introduce a new signal type for this
>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>> similar interface as qemu_irq.
>
> A separate type would complicate the delivery of the feedback value
> across GPIO pins (as Paul requested for the RTC->HPET routing).
>
>>
>>> I think the real solution to coalescing is put the logic inside one
>>> device, in this case APIC because it has the information about irq
>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>> information and whether they get delivered or not. If not, an internal
>>> timer is installed which injects the lost irqs.
>
> That won't fly as the IRQs will already arrive at the APIC with a
> sufficiently high jitter. At the bare minimum, you need to tell the
> interrupt controller about the fact that a particular IRQ should be
> delivered at a specific regular rate. For this, you also need a generic
> interface - nothing really "won".

OK, let's simplify: just reinject at next possible chance. No need to
monitor or tell anything.
Jan Kiszka - May 26, 2010, 8:09 p.m.
Blue Swirl wrote:
> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Anthony Liguori wrote:
>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>
>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>
>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>> workarounds need to register handlers that return the appropriate
>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>> consider an IRQ coalesced if all other sinks either report
>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>
>>>> No real devices are interested whether any of their output lines are
>>>> even connected. This would introduce a new signal type, bidirectional
>>>> multi-level, which is not correct.
>>>>
>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>> to a suggestion that we ought to introduce a new signal type for this
>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>> similar interface as qemu_irq.
>> A separate type would complicate the delivery of the feedback value
>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>
>>>> I think the real solution to coalescing is put the logic inside one
>>>> device, in this case APIC because it has the information about irq
>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>> information and whether they get delivered or not. If not, an internal
>>>> timer is installed which injects the lost irqs.
>> That won't fly as the IRQs will already arrive at the APIC with a
>> sufficiently high jitter. At the bare minimum, you need to tell the
>> interrupt controller about the fact that a particular IRQ should be
>> delivered at a specific regular rate. For this, you also need a generic
>> interface - nothing really "won".
> 
> OK, let's simplify: just reinject at next possible chance. No need to
> monitor or tell anything.

There are guests that won't like this (I know of one in-house, but
others may even have more examples), specifically if you end up firing
multiple IRQs in a row due to a longer backlog. For that reason, the RTC
spreads the reinjection according to the current rate.

And even if the rate did not matter, the APIC woult still have to now
about the fact that an IRQ is really periodic and does not only appear
as such for a certain interval. This really does not sound like
simplifying things or even make them cleaner.

Jan
Blue Swirl - May 26, 2010, 8:14 p.m.
On Wed, May 26, 2010 at 8:08 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Tue, May 25, 2010 at 11:44:50PM +0200, Jan Kiszka wrote:
>> >
>> >> I think the real solution to coalescing is put the logic inside one
>> >> device, in this case APIC because it has the information about irq
>> >> delivery. APIC could monitor incoming RTC irqs for frequency
>> >> information and whether they get delivered or not. If not, an internal
>> >> timer is installed which injects the lost irqs.
>>
>> That won't fly as the IRQs will already arrive at the APIC with a
>> sufficiently high jitter. At the bare minimum, you need to tell the
>> interrupt controller about the fact that a particular IRQ should be
>> delivered at a specific regular rate. For this, you also need a generic
>> interface - nothing really "won".
>>
> Not only that. Suppose RTC runs with 64Hz frequency and N interrupts
> were coalesced during some period. Now the guest reprograms RTC to
> 1024Hz. N should be scaled accordingly otherwise reinjection will not
> fix the drift. Do you propose to put this logic into APIC to?

Interesting case, I don't think this is handled by the current code
either. Could this happen if the target machine does not have HPET and
the guest runs a tickless kernel?

I think the guest would be buggy to reprogram the timer without
checking that the interrupt from the previous timer frequency won't
interfere, for example by stopping the clock, or doing the
reprogramming only at timer interrupt handler. Otherwise the period
may be unpredictable for one period, which means that the time base
has been lost.

But let's consider how this could be handled with the current code (or
with the magical interrupts). When doing the scaling, the guest, while
reprogramming, is unaware of the old queued interrupts with the
previous rate. Why would scaling these be more correct? I'd think the
old ones should be just reinjected ASAP without any scaling.
Blue Swirl - May 26, 2010, 8:35 p.m.
On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> Anthony Liguori wrote:
>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>>
>>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>>
>>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>>> workarounds need to register handlers that return the appropriate
>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>>> consider an IRQ coalesced if all other sinks either report
>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>>
>>>>> No real devices are interested whether any of their output lines are
>>>>> even connected. This would introduce a new signal type, bidirectional
>>>>> multi-level, which is not correct.
>>>>>
>>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>>> to a suggestion that we ought to introduce a new signal type for this
>>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>>> similar interface as qemu_irq.
>>> A separate type would complicate the delivery of the feedback value
>>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>>
>>>>> I think the real solution to coalescing is put the logic inside one
>>>>> device, in this case APIC because it has the information about irq
>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>> information and whether they get delivered or not. If not, an internal
>>>>> timer is installed which injects the lost irqs.
>>> That won't fly as the IRQs will already arrive at the APIC with a
>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>> interrupt controller about the fact that a particular IRQ should be
>>> delivered at a specific regular rate. For this, you also need a generic
>>> interface - nothing really "won".
>>
>> OK, let's simplify: just reinject at next possible chance. No need to
>> monitor or tell anything.
>
> There are guests that won't like this (I know of one in-house, but
> others may even have more examples), specifically if you end up firing
> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
> spreads the reinjection according to the current rate.

Then reinject with a constant delay, or next CPU exit. Such buggy
guests could also be assisted with special handling (like win2k
install hack), for example guest instructions could be counted
(approximately, for example using TB size or TSC) and only inject
after at least N instructions have passed.

> And even if the rate did not matter, the APIC woult still have to now
> about the fact that an IRQ is really periodic and does not only appear
> as such for a certain interval. This really does not sound like
> simplifying things or even make them cleaner.

It would, the voodoo would be contained only in APIC, RTC would be
just like any other device. With the bidirectional irqs, this voodoo
would probably eventually spread to many other devices. The logical
conclusion of that would be a system where all devices would be
careful not to disturb the guest at wrong moment because that would
trigger a bug.

At the other extreme, would it be possible to make the educated guests
aware of the virtualization also in clock aspect: virtio-clock?
Jan Kiszka - May 26, 2010, 10:35 p.m.
Blue Swirl wrote:
>>>>>> I think the real solution to coalescing is put the logic inside one
>>>>>> device, in this case APIC because it has the information about irq
>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>>> information and whether they get delivered or not. If not, an internal
>>>>>> timer is installed which injects the lost irqs.
>>>> That won't fly as the IRQs will already arrive at the APIC with a
>>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>>> interrupt controller about the fact that a particular IRQ should be
>>>> delivered at a specific regular rate. For this, you also need a generic
>>>> interface - nothing really "won".
>>> OK, let's simplify: just reinject at next possible chance. No need to
>>> monitor or tell anything.
>> There are guests that won't like this (I know of one in-house, but
>> others may even have more examples), specifically if you end up firing
>> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>> spreads the reinjection according to the current rate.
> 
> Then reinject with a constant delay, or next CPU exit. Such buggy
> guests could also be assisted with special handling (like win2k
> install hack), for example guest instructions could be counted
> (approximately, for example using TB size or TSC) and only inject
> after at least N instructions have passed.

Let's assume that would be an alternative...

> 
>> And even if the rate did not matter, the APIC woult still have to now
>> about the fact that an IRQ is really periodic and does not only appear
>> as such for a certain interval. This really does not sound like
>> simplifying things or even make them cleaner.
> 
> It would, the voodoo would be contained only in APIC, RTC would be
> just like any other device.

...we would still need that communication channel from the RTC to the
APIC to tell the latter about entering/leaving periodic IRQ generation.
And as we don't want the RTC requiring any clue about who's finally
delivering the IRQ, we still need some generic interface.

This cannot be contained in the APIC - even leaving the PIC as another
IRQ delivery service aside for now, and maybe other virtualized
architectures.

> With the bidirectional irqs, this voodoo
> would probably eventually spread to many other devices.

Yes, the HPET was reported to need this feature as well.

> The logical
> conclusion of that would be a system where all devices would be
> careful not to disturb the guest at wrong moment because that would
> trigger a bug.
> 
> At the other extreme, would it be possible to make the educated guests
> aware of the virtualization also in clock aspect: virtio-clock?

Paravirtual clocks are around since ages (e.g. kvm-clock), just
unfortunately not long enough to find them supported by legacy guests.

Jan
Paul Brook - May 26, 2010, 11:26 p.m.
> At the other extreme, would it be possible to make the educated guests
> aware of the virtualization also in clock aspect: virtio-clock?

The guest doesn't even need to be aware of virtualization. It just needs to be 
able to accommodate the lack of guaranteed realtime behavior.

The fundamental problem here is that some guest operating systems assume that 
the hardware provides certain realtime guarantees with respect to execution of 
interrupt handlers.  In particular they assume that the CPU will always be 
able to complete execution of the timer IRQ handler before the periodic timer 
triggers again.  In most virtualized environments you have absolutely no 
guarantee of realtime response.

With Linux guests this was solved a long time ago by the introduction of 
tickless kernels.  These separate the timekeeping from wakeup events, so it 
doesn't matter if several wakeup triggers end up getting merged (either at the 
hardware level or via top/bottom half guest IRQ handlers).


It's worth mentioning that this problem also occurs on real hardware, 
typically due to lame hardware/drivers which end up masking interrupts or 
otherwise stall the CPU for for long periods of time.


The PIT hack attempts to workaround broken guests by adding artificial latency 
to the timer event, ensuring that the guest "sees" them all.  Unfortunately 
guests vary on when it is safe for them to see the next timer event, and 
trying to observe this behavior involves potentially harmful heuristics and 
collusion between unrelated devices (e.g. interrupt controller and timer).

In some cases we don't even do that, and just reschedule the event some 
arbitrarily small amount of time later. This assumes the guest to do useful 
work in that time. In a single threaded environment this is probably true - 
qemu got enough CPU to inject the first interrupt, so will probably manage to 
execute some guest code before the end of its timeslice. In an environment 
where interrupt processing/delivery and execution of the guest code happen in 
different threads this becomes increasingly likely to fail.

Paul
Gleb Natapov - May 27, 2010, 5:42 a.m.
On Wed, May 26, 2010 at 08:14:38PM +0000, Blue Swirl wrote:
> On Wed, May 26, 2010 at 8:08 AM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Tue, May 25, 2010 at 11:44:50PM +0200, Jan Kiszka wrote:
> >> >
> >> >> I think the real solution to coalescing is put the logic inside one
> >> >> device, in this case APIC because it has the information about irq
> >> >> delivery. APIC could monitor incoming RTC irqs for frequency
> >> >> information and whether they get delivered or not. If not, an internal
> >> >> timer is installed which injects the lost irqs.
> >>
> >> That won't fly as the IRQs will already arrive at the APIC with a
> >> sufficiently high jitter. At the bare minimum, you need to tell the
> >> interrupt controller about the fact that a particular IRQ should be
> >> delivered at a specific regular rate. For this, you also need a generic
> >> interface - nothing really "won".
> >>
> > Not only that. Suppose RTC runs with 64Hz frequency and N interrupts
> > were coalesced during some period. Now the guest reprograms RTC to
> > 1024Hz. N should be scaled accordingly otherwise reinjection will not
> > fix the drift. Do you propose to put this logic into APIC to?
> 
> Interesting case, I don't think this is handled by the current code
> either.
Of course it is:
        if(period != s->period)
            s->irq_coalesced = (s->irq_coalesced * s->period) / period;

Qemu is not a toy anymore and dials with real loads in production now.

>          Could this happen if the target machine does not have HPET and
> the guest runs a tickless kernel?
> 
Windows doesn't have tickles kernel and Linux does not use RTC as time
source.

> I think the guest would be buggy to reprogram the timer without
> checking that the interrupt from the previous timer frequency won't
> interfere, for example by stopping the clock, or doing the
> reprogramming only at timer interrupt handler. Otherwise the period
> may be unpredictable for one period, which means that the time base
> has been lost.
Guest knows nothing about interrupt that was not yet delivered. And
Windows doesn't stop clock while changing frequency.

> 
> But let's consider how this could be handled with the current code (or
> with the magical interrupts). When doing the scaling, the guest, while
> reprogramming, is unaware of the old queued interrupts with the
> previous rate. Why would scaling these be more correct? I'd think the
> old ones should be just reinjected ASAP without any scaling.
If old once will be reinjected as is time calculation in the guest
will be incorrect. Windows is really stupid when it comes to time
keeping. It just count interrupts and multiplies them by clock frequency
to calculate the time. So if qemu accumulated 64 coalesced interrupts
while clock frequency was 64Hz it means that the guest 1 second behind
the time. If the guest change frequency to 1024hz reinfecting 64 interrupts
will not fix guest time, 1024 interrupts should be injected instead.

--
			Gleb.
Gleb Natapov - May 27, 2010, 5:58 a.m.
On Wed, May 26, 2010 at 10:09:52PM +0200, Jan Kiszka wrote:
> Blue Swirl wrote:
> > On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >> Anthony Liguori wrote:
> >>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
> >>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
> >>>>
> >>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
> >>>>>
> >>>>> This allows to communicate potential IRQ coalescing during delivery from
> >>>>> the sink back to the source. Targets that support IRQ coalescing
> >>>>> workarounds need to register handlers that return the appropriate
> >>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> >>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> >>>>> apply its workaround. If multiple sinks exist, the source may only
> >>>>> consider an IRQ coalesced if all other sinks either report
> >>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> >>>>>
> >>>> No real devices are interested whether any of their output lines are
> >>>> even connected. This would introduce a new signal type, bidirectional
> >>>> multi-level, which is not correct.
> >>>>
> >>> I don't think it's really an issue of correct, but I wouldn't disagree
> >>> to a suggestion that we ought to introduce a new signal type for this
> >>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
> >>> similar interface as qemu_irq.
> >> A separate type would complicate the delivery of the feedback value
> >> across GPIO pins (as Paul requested for the RTC->HPET routing).
> >>
> >>>> I think the real solution to coalescing is put the logic inside one
> >>>> device, in this case APIC because it has the information about irq
> >>>> delivery. APIC could monitor incoming RTC irqs for frequency
> >>>> information and whether they get delivered or not. If not, an internal
> >>>> timer is installed which injects the lost irqs.
> >> That won't fly as the IRQs will already arrive at the APIC with a
> >> sufficiently high jitter. At the bare minimum, you need to tell the
> >> interrupt controller about the fact that a particular IRQ should be
> >> delivered at a specific regular rate. For this, you also need a generic
> >> interface - nothing really "won".
> > 
> > OK, let's simplify: just reinject at next possible chance. No need to
> > monitor or tell anything.
> 
> There are guests that won't like this (I know of one in-house, but
> others may even have more examples), specifically if you end up firing
> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
> spreads the reinjection according to the current rate.
> 
No need to look far for such guests. See commit dd17765b5f77ca.

> And even if the rate did not matter, the APIC woult still have to now
> about the fact that an IRQ is really periodic and does not only appear
> as such for a certain interval. This really does not sound like
> simplifying things or even make them cleaner.
> 
> Jan
> 



--
			Gleb.
Gleb Natapov - May 27, 2010, 6:13 a.m.
On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> > Blue Swirl wrote:
> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >>> Anthony Liguori wrote:
> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
> >>>>>
> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
> >>>>>>
> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
> >>>>>> the sink back to the source. Targets that support IRQ coalescing
> >>>>>> workarounds need to register handlers that return the appropriate
> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> >>>>>> apply its workaround. If multiple sinks exist, the source may only
> >>>>>> consider an IRQ coalesced if all other sinks either report
> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> >>>>>>
> >>>>> No real devices are interested whether any of their output lines are
> >>>>> even connected. This would introduce a new signal type, bidirectional
> >>>>> multi-level, which is not correct.
> >>>>>
> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
> >>>> to a suggestion that we ought to introduce a new signal type for this
> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
> >>>> similar interface as qemu_irq.
> >>> A separate type would complicate the delivery of the feedback value
> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
> >>>
> >>>>> I think the real solution to coalescing is put the logic inside one
> >>>>> device, in this case APIC because it has the information about irq
> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
> >>>>> information and whether they get delivered or not. If not, an internal
> >>>>> timer is installed which injects the lost irqs.
> >>> That won't fly as the IRQs will already arrive at the APIC with a
> >>> sufficiently high jitter. At the bare minimum, you need to tell the
> >>> interrupt controller about the fact that a particular IRQ should be
> >>> delivered at a specific regular rate. For this, you also need a generic
> >>> interface - nothing really "won".
> >>
> >> OK, let's simplify: just reinject at next possible chance. No need to
> >> monitor or tell anything.
> >
> > There are guests that won't like this (I know of one in-house, but
> > others may even have more examples), specifically if you end up firing
> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
> > spreads the reinjection according to the current rate.
> 
> Then reinject with a constant delay, or next CPU exit. Such buggy
If guest's time frequency is the same as host time frequency you can't
reinject with constant delay. That is why current code mixes two
approaches: reinject M interrupts in a raw then delay.

> guests could also be assisted with special handling (like win2k
> install hack), for example guest instructions could be counted
> (approximately, for example using TB size or TSC) and only inject
> after at least N instructions have passed.
Guest instructions cannot be easily counted in KVM (it can be done more
or less reliably using perf counters, may be).

> 
> > And even if the rate did not matter, the APIC woult still have to now
> > about the fact that an IRQ is really periodic and does not only appear
> > as such for a certain interval. This really does not sound like
> > simplifying things or even make them cleaner.
> 
> It would, the voodoo would be contained only in APIC, RTC would be
> just like any other device. With the bidirectional irqs, this voodoo
> would probably eventually spread to many other devices. The logical
> conclusion of that would be a system where all devices would be
> careful not to disturb the guest at wrong moment because that would
> trigger a bug.
> 
This voodoo will be so complex and unreliable that it will make RTC hack
pale in comparison (and I still don't see how you are going to make it
actually work). The fact is that timer device is not "just like any
other device" in virtual world. Any other device is easy: you just
implement spec as close as possible and everything works. For time
source device this is not enough. You can implement RTC+HPET to the
letter and your guest will drift like crazy.

> At the other extreme, would it be possible to make the educated guests
> aware of the virtualization also in clock aspect: virtio-clock?
No.

--
			Gleb.
Blue Swirl - May 27, 2010, 5:56 p.m.
On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>> At the other extreme, would it be possible to make the educated guests
>> aware of the virtualization also in clock aspect: virtio-clock?
>
> The guest doesn't even need to be aware of virtualization. It just needs to be
> able to accommodate the lack of guaranteed realtime behavior.
>
> The fundamental problem here is that some guest operating systems assume that
> the hardware provides certain realtime guarantees with respect to execution of
> interrupt handlers.  In particular they assume that the CPU will always be
> able to complete execution of the timer IRQ handler before the periodic timer
> triggers again.  In most virtualized environments you have absolutely no
> guarantee of realtime response.
>
> With Linux guests this was solved a long time ago by the introduction of
> tickless kernels.  These separate the timekeeping from wakeup events, so it
> doesn't matter if several wakeup triggers end up getting merged (either at the
> hardware level or via top/bottom half guest IRQ handlers).
>
>
> It's worth mentioning that this problem also occurs on real hardware,
> typically due to lame hardware/drivers which end up masking interrupts or
> otherwise stall the CPU for for long periods of time.
>
>
> The PIT hack attempts to workaround broken guests by adding artificial latency
> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
> guests vary on when it is safe for them to see the next timer event, and
> trying to observe this behavior involves potentially harmful heuristics and
> collusion between unrelated devices (e.g. interrupt controller and timer).
>
> In some cases we don't even do that, and just reschedule the event some
> arbitrarily small amount of time later. This assumes the guest to do useful
> work in that time. In a single threaded environment this is probably true -
> qemu got enough CPU to inject the first interrupt, so will probably manage to
> execute some guest code before the end of its timeslice. In an environment
> where interrupt processing/delivery and execution of the guest code happen in
> different threads this becomes increasingly likely to fail.

So any voodoo around timer events is doomed to fail in some cases.
What's the amount of hacks what we want then? Is there any generic
solution, like slowing down the guest system to the point where we can
guarantee the interrupt rate vs. CPU execution speed?
Jan Kiszka - May 27, 2010, 6:31 p.m.
Blue Swirl wrote:
> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>> At the other extreme, would it be possible to make the educated guests
>>> aware of the virtualization also in clock aspect: virtio-clock?
>> The guest doesn't even need to be aware of virtualization. It just needs to be
>> able to accommodate the lack of guaranteed realtime behavior.
>>
>> The fundamental problem here is that some guest operating systems assume that
>> the hardware provides certain realtime guarantees with respect to execution of
>> interrupt handlers.  In particular they assume that the CPU will always be
>> able to complete execution of the timer IRQ handler before the periodic timer
>> triggers again.  In most virtualized environments you have absolutely no
>> guarantee of realtime response.
>>
>> With Linux guests this was solved a long time ago by the introduction of
>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>> doesn't matter if several wakeup triggers end up getting merged (either at the
>> hardware level or via top/bottom half guest IRQ handlers).
>>
>>
>> It's worth mentioning that this problem also occurs on real hardware,
>> typically due to lame hardware/drivers which end up masking interrupts or
>> otherwise stall the CPU for for long periods of time.
>>
>>
>> The PIT hack attempts to workaround broken guests by adding artificial latency
>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>> guests vary on when it is safe for them to see the next timer event, and
>> trying to observe this behavior involves potentially harmful heuristics and
>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>
>> In some cases we don't even do that, and just reschedule the event some
>> arbitrarily small amount of time later. This assumes the guest to do useful
>> work in that time. In a single threaded environment this is probably true -
>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>> execute some guest code before the end of its timeslice. In an environment
>> where interrupt processing/delivery and execution of the guest code happen in
>> different threads this becomes increasingly likely to fail.
> 
> So any voodoo around timer events is doomed to fail in some cases.
> What's the amount of hacks what we want then? Is there any generic

The aim of this patch is to reduce the amount of existing and upcoming
hacks. It may still require some refinements, but I think we haven't
found any smarter approach yet that fits existing use cases.

> solution, like slowing down the guest system to the point where we can
> guarantee the interrupt rate vs. CPU execution speed?

That's generally a non-option in virtualized production environments.
Specifically if the guest system lost interrupts due to host
overcommitment, you do not want it slow down even further.

Jan
Blue Swirl - May 27, 2010, 6:37 p.m.
2010/5/27 Gleb Natapov <gleb@redhat.com>:
> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> > Blue Swirl wrote:
>> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> >>> Anthony Liguori wrote:
>> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>> >>>>>
>> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>> >>>>>>
>> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
>> >>>>>> the sink back to the source. Targets that support IRQ coalescing
>> >>>>>> workarounds need to register handlers that return the appropriate
>> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>> >>>>>> apply its workaround. If multiple sinks exist, the source may only
>> >>>>>> consider an IRQ coalesced if all other sinks either report
>> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>> >>>>>>
>> >>>>> No real devices are interested whether any of their output lines are
>> >>>>> even connected. This would introduce a new signal type, bidirectional
>> >>>>> multi-level, which is not correct.
>> >>>>>
>> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
>> >>>> to a suggestion that we ought to introduce a new signal type for this
>> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>> >>>> similar interface as qemu_irq.
>> >>> A separate type would complicate the delivery of the feedback value
>> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>> >>>
>> >>>>> I think the real solution to coalescing is put the logic inside one
>> >>>>> device, in this case APIC because it has the information about irq
>> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>> >>>>> information and whether they get delivered or not. If not, an internal
>> >>>>> timer is installed which injects the lost irqs.
>> >>> That won't fly as the IRQs will already arrive at the APIC with a
>> >>> sufficiently high jitter. At the bare minimum, you need to tell the
>> >>> interrupt controller about the fact that a particular IRQ should be
>> >>> delivered at a specific regular rate. For this, you also need a generic
>> >>> interface - nothing really "won".
>> >>
>> >> OK, let's simplify: just reinject at next possible chance. No need to
>> >> monitor or tell anything.
>> >
>> > There are guests that won't like this (I know of one in-house, but
>> > others may even have more examples), specifically if you end up firing
>> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>> > spreads the reinjection according to the current rate.
>>
>> Then reinject with a constant delay, or next CPU exit. Such buggy
> If guest's time frequency is the same as host time frequency you can't
> reinject with constant delay. That is why current code mixes two
> approaches: reinject M interrupts in a raw then delay.

This approach can be also used by APIC-only version.

>> guests could also be assisted with special handling (like win2k
>> install hack), for example guest instructions could be counted
>> (approximately, for example using TB size or TSC) and only inject
>> after at least N instructions have passed.
> Guest instructions cannot be easily counted in KVM (it can be done more
> or less reliably using perf counters, may be).

Aren't there any debug registers or perf counters, which can generate
an interrupt after some number of instructions have been executed?

>>
>> > And even if the rate did not matter, the APIC woult still have to now
>> > about the fact that an IRQ is really periodic and does not only appear
>> > as such for a certain interval. This really does not sound like
>> > simplifying things or even make them cleaner.
>>
>> It would, the voodoo would be contained only in APIC, RTC would be
>> just like any other device. With the bidirectional irqs, this voodoo
>> would probably eventually spread to many other devices. The logical
>> conclusion of that would be a system where all devices would be
>> careful not to disturb the guest at wrong moment because that would
>> trigger a bug.
>>
> This voodoo will be so complex and unreliable that it will make RTC hack
> pale in comparison (and I still don't see how you are going to make it
> actually work).

Implement everything inside APIC: only coalescing and reinjection.
Maybe that version would not bend backwards as much as the current to
cater for buggy hosts.

> The fact is that timer device is not "just like any
> other device" in virtual world. Any other device is easy: you just
> implement spec as close as possible and everything works. For time
> source device this is not enough. You can implement RTC+HPET to the
> letter and your guest will drift like crazy.

It's doable: a cycle accurate emulator will not cause any drift,
without any voodoo. The interrupts would come after executing the same
instruction as the real HW. For emulating any sufficiently buggy
guests in any sufficiently desperate low resource conditions, this may
be the only option that will always work.

>
>> At the other extreme, would it be possible to make the educated guests
>> aware of the virtualization also in clock aspect: virtio-clock?
> No.
>
> --
>                        Gleb.
>
Blue Swirl - May 27, 2010, 6:53 p.m.
On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>>> At the other extreme, would it be possible to make the educated guests
>>>> aware of the virtualization also in clock aspect: virtio-clock?
>>> The guest doesn't even need to be aware of virtualization. It just needs to be
>>> able to accommodate the lack of guaranteed realtime behavior.
>>>
>>> The fundamental problem here is that some guest operating systems assume that
>>> the hardware provides certain realtime guarantees with respect to execution of
>>> interrupt handlers.  In particular they assume that the CPU will always be
>>> able to complete execution of the timer IRQ handler before the periodic timer
>>> triggers again.  In most virtualized environments you have absolutely no
>>> guarantee of realtime response.
>>>
>>> With Linux guests this was solved a long time ago by the introduction of
>>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>>> doesn't matter if several wakeup triggers end up getting merged (either at the
>>> hardware level or via top/bottom half guest IRQ handlers).
>>>
>>>
>>> It's worth mentioning that this problem also occurs on real hardware,
>>> typically due to lame hardware/drivers which end up masking interrupts or
>>> otherwise stall the CPU for for long periods of time.
>>>
>>>
>>> The PIT hack attempts to workaround broken guests by adding artificial latency
>>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>>> guests vary on when it is safe for them to see the next timer event, and
>>> trying to observe this behavior involves potentially harmful heuristics and
>>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>>
>>> In some cases we don't even do that, and just reschedule the event some
>>> arbitrarily small amount of time later. This assumes the guest to do useful
>>> work in that time. In a single threaded environment this is probably true -
>>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>>> execute some guest code before the end of its timeslice. In an environment
>>> where interrupt processing/delivery and execution of the guest code happen in
>>> different threads this becomes increasingly likely to fail.
>>
>> So any voodoo around timer events is doomed to fail in some cases.
>> What's the amount of hacks what we want then? Is there any generic
>
> The aim of this patch is to reduce the amount of existing and upcoming
> hacks. It may still require some refinements, but I think we haven't
> found any smarter approach yet that fits existing use cases.

I don't feel we have tried other possibilities hard enough.

>> solution, like slowing down the guest system to the point where we can
>> guarantee the interrupt rate vs. CPU execution speed?
>
> That's generally a non-option in virtualized production environments.
> Specifically if the guest system lost interrupts due to host
> overcommitment, you do not want it slow down even further.

I meant that the guest time could be scaled down, for example 2s in
wall clock time would be presented to the guest as 1s. Then the amount
of CPU cycles between timer interrupts would increase and hopefully
the guest can keep up. If the guest sleeps, time base could be
accelerated to catch up with wall clock and then set back to 1:1 rate.

Slowing down could be triggered by measuring the guest load (for
example, by checking for presence of halt instructions), if it's close
to 1, time would be slowed down. If the guest starts to issue halt
instructions because it's more idle, we can increase speed.

If this approach worked, even APIC could be made ignorant about
coalescing voodoo so it should be a major cleanup.
Jan Kiszka - May 27, 2010, 7:08 p.m.
Blue Swirl wrote:
> On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>>>> At the other extreme, would it be possible to make the educated guests
>>>>> aware of the virtualization also in clock aspect: virtio-clock?
>>>> The guest doesn't even need to be aware of virtualization. It just needs to be
>>>> able to accommodate the lack of guaranteed realtime behavior.
>>>>
>>>> The fundamental problem here is that some guest operating systems assume that
>>>> the hardware provides certain realtime guarantees with respect to execution of
>>>> interrupt handlers.  In particular they assume that the CPU will always be
>>>> able to complete execution of the timer IRQ handler before the periodic timer
>>>> triggers again.  In most virtualized environments you have absolutely no
>>>> guarantee of realtime response.
>>>>
>>>> With Linux guests this was solved a long time ago by the introduction of
>>>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>>>> doesn't matter if several wakeup triggers end up getting merged (either at the
>>>> hardware level or via top/bottom half guest IRQ handlers).
>>>>
>>>>
>>>> It's worth mentioning that this problem also occurs on real hardware,
>>>> typically due to lame hardware/drivers which end up masking interrupts or
>>>> otherwise stall the CPU for for long periods of time.
>>>>
>>>>
>>>> The PIT hack attempts to workaround broken guests by adding artificial latency
>>>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>>>> guests vary on when it is safe for them to see the next timer event, and
>>>> trying to observe this behavior involves potentially harmful heuristics and
>>>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>>>
>>>> In some cases we don't even do that, and just reschedule the event some
>>>> arbitrarily small amount of time later. This assumes the guest to do useful
>>>> work in that time. In a single threaded environment this is probably true -
>>>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>>>> execute some guest code before the end of its timeslice. In an environment
>>>> where interrupt processing/delivery and execution of the guest code happen in
>>>> different threads this becomes increasingly likely to fail.
>>> So any voodoo around timer events is doomed to fail in some cases.
>>> What's the amount of hacks what we want then? Is there any generic
>> The aim of this patch is to reduce the amount of existing and upcoming
>> hacks. It may still require some refinements, but I think we haven't
>> found any smarter approach yet that fits existing use cases.
> 
> I don't feel we have tried other possibilities hard enough.

Well, seeing prototypes wouldn't be bad, also to run real load againt
them. But at least I'm currently clueless what to implement.

> 
>>> solution, like slowing down the guest system to the point where we can
>>> guarantee the interrupt rate vs. CPU execution speed?
>> That's generally a non-option in virtualized production environments.
>> Specifically if the guest system lost interrupts due to host
>> overcommitment, you do not want it slow down even further.
> 
> I meant that the guest time could be scaled down, for example 2s in
> wall clock time would be presented to the guest as 1s.

But that is precisely what already happens when the guest loses timer
interrupts. There is no other time source for this kind of guests -
often except for some external events generated by systems which you
don't want to fall behind arbitrarily.

> Then the amount
> of CPU cycles between timer interrupts would increase and hopefully
> the guest can keep up. If the guest sleeps, time base could be
> accelerated to catch up with wall clock and then set back to 1:1 rate.

Can't follow you ATM, sorry. What should be slowed down then? And how
precisely?

Jan

> 
> Slowing down could be triggered by measuring the guest load (for
> example, by checking for presence of halt instructions), if it's close
> to 1, time would be slowed down. If the guest starts to issue halt
> instructions because it's more idle, we can increase speed.
> 
> If this approach worked, even APIC could be made ignorant about
> coalescing voodoo so it should be a major cleanup.
Blue Swirl - May 27, 2010, 7:19 p.m.
On Thu, May 27, 2010 at 7:08 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> Blue Swirl wrote:
>>>> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>>>>> At the other extreme, would it be possible to make the educated guests
>>>>>> aware of the virtualization also in clock aspect: virtio-clock?
>>>>> The guest doesn't even need to be aware of virtualization. It just needs to be
>>>>> able to accommodate the lack of guaranteed realtime behavior.
>>>>>
>>>>> The fundamental problem here is that some guest operating systems assume that
>>>>> the hardware provides certain realtime guarantees with respect to execution of
>>>>> interrupt handlers.  In particular they assume that the CPU will always be
>>>>> able to complete execution of the timer IRQ handler before the periodic timer
>>>>> triggers again.  In most virtualized environments you have absolutely no
>>>>> guarantee of realtime response.
>>>>>
>>>>> With Linux guests this was solved a long time ago by the introduction of
>>>>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>>>>> doesn't matter if several wakeup triggers end up getting merged (either at the
>>>>> hardware level or via top/bottom half guest IRQ handlers).
>>>>>
>>>>>
>>>>> It's worth mentioning that this problem also occurs on real hardware,
>>>>> typically due to lame hardware/drivers which end up masking interrupts or
>>>>> otherwise stall the CPU for for long periods of time.
>>>>>
>>>>>
>>>>> The PIT hack attempts to workaround broken guests by adding artificial latency
>>>>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>>>>> guests vary on when it is safe for them to see the next timer event, and
>>>>> trying to observe this behavior involves potentially harmful heuristics and
>>>>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>>>>
>>>>> In some cases we don't even do that, and just reschedule the event some
>>>>> arbitrarily small amount of time later. This assumes the guest to do useful
>>>>> work in that time. In a single threaded environment this is probably true -
>>>>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>>>>> execute some guest code before the end of its timeslice. In an environment
>>>>> where interrupt processing/delivery and execution of the guest code happen in
>>>>> different threads this becomes increasingly likely to fail.
>>>> So any voodoo around timer events is doomed to fail in some cases.
>>>> What's the amount of hacks what we want then? Is there any generic
>>> The aim of this patch is to reduce the amount of existing and upcoming
>>> hacks. It may still require some refinements, but I think we haven't
>>> found any smarter approach yet that fits existing use cases.
>>
>> I don't feel we have tried other possibilities hard enough.
>
> Well, seeing prototypes wouldn't be bad, also to run real load againt
> them. But at least I'm currently clueless what to implement.

Perhaps now is then not the time to rush to implement something, but
to brainstorm for a clean solution.

>>
>>>> solution, like slowing down the guest system to the point where we can
>>>> guarantee the interrupt rate vs. CPU execution speed?
>>> That's generally a non-option in virtualized production environments.
>>> Specifically if the guest system lost interrupts due to host
>>> overcommitment, you do not want it slow down even further.
>>
>> I meant that the guest time could be scaled down, for example 2s in
>> wall clock time would be presented to the guest as 1s.
>
> But that is precisely what already happens when the guest loses timer
> interrupts. There is no other time source for this kind of guests -
> often except for some external events generated by systems which you
> don't want to fall behind arbitrarily.
>
>> Then the amount
>> of CPU cycles between timer interrupts would increase and hopefully
>> the guest can keep up. If the guest sleeps, time base could be
>> accelerated to catch up with wall clock and then set back to 1:1 rate.
>
> Can't follow you ATM, sorry. What should be slowed down then? And how
> precisely?

I think vm_clock and everything that depends on vm_clock, also
rtc_clock should be tied to vm_clock in this mode, not host_clock.

>
> Jan
>
>>
>> Slowing down could be triggered by measuring the guest load (for
>> example, by checking for presence of halt instructions), if it's close
>> to 1, time would be slowed down. If the guest starts to issue halt
>> instructions because it's more idle, we can increase speed.
>>
>> If this approach worked, even APIC could be made ignorant about
>> coalescing voodoo so it should be a major cleanup.
>
>
>
Jan Kiszka - May 27, 2010, 10:19 p.m.
Blue Swirl wrote:
> On Thu, May 27, 2010 at 7:08 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>> On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>> Blue Swirl wrote:
>>>>> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>>>>>> At the other extreme, would it be possible to make the educated guests
>>>>>>> aware of the virtualization also in clock aspect: virtio-clock?
>>>>>> The guest doesn't even need to be aware of virtualization. It just needs to be
>>>>>> able to accommodate the lack of guaranteed realtime behavior.
>>>>>>
>>>>>> The fundamental problem here is that some guest operating systems assume that
>>>>>> the hardware provides certain realtime guarantees with respect to execution of
>>>>>> interrupt handlers.  In particular they assume that the CPU will always be
>>>>>> able to complete execution of the timer IRQ handler before the periodic timer
>>>>>> triggers again.  In most virtualized environments you have absolutely no
>>>>>> guarantee of realtime response.
>>>>>>
>>>>>> With Linux guests this was solved a long time ago by the introduction of
>>>>>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>>>>>> doesn't matter if several wakeup triggers end up getting merged (either at the
>>>>>> hardware level or via top/bottom half guest IRQ handlers).
>>>>>>
>>>>>>
>>>>>> It's worth mentioning that this problem also occurs on real hardware,
>>>>>> typically due to lame hardware/drivers which end up masking interrupts or
>>>>>> otherwise stall the CPU for for long periods of time.
>>>>>>
>>>>>>
>>>>>> The PIT hack attempts to workaround broken guests by adding artificial latency
>>>>>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>>>>>> guests vary on when it is safe for them to see the next timer event, and
>>>>>> trying to observe this behavior involves potentially harmful heuristics and
>>>>>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>>>>>
>>>>>> In some cases we don't even do that, and just reschedule the event some
>>>>>> arbitrarily small amount of time later. This assumes the guest to do useful
>>>>>> work in that time. In a single threaded environment this is probably true -
>>>>>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>>>>>> execute some guest code before the end of its timeslice. In an environment
>>>>>> where interrupt processing/delivery and execution of the guest code happen in
>>>>>> different threads this becomes increasingly likely to fail.
>>>>> So any voodoo around timer events is doomed to fail in some cases.
>>>>> What's the amount of hacks what we want then? Is there any generic
>>>> The aim of this patch is to reduce the amount of existing and upcoming
>>>> hacks. It may still require some refinements, but I think we haven't
>>>> found any smarter approach yet that fits existing use cases.
>>> I don't feel we have tried other possibilities hard enough.
>> Well, seeing prototypes wouldn't be bad, also to run real load againt
>> them. But at least I'm currently clueless what to implement.
> 
> Perhaps now is then not the time to rush to implement something, but
> to brainstorm for a clean solution.

And sometimes it can help to understand how ideas could even be improved
or why others doesn't work at all.

> 
>>>>> solution, like slowing down the guest system to the point where we can
>>>>> guarantee the interrupt rate vs. CPU execution speed?
>>>> That's generally a non-option in virtualized production environments.
>>>> Specifically if the guest system lost interrupts due to host
>>>> overcommitment, you do not want it slow down even further.
>>> I meant that the guest time could be scaled down, for example 2s in
>>> wall clock time would be presented to the guest as 1s.
>> But that is precisely what already happens when the guest loses timer
>> interrupts. There is no other time source for this kind of guests -
>> often except for some external events generated by systems which you
>> don't want to fall behind arbitrarily.
>>
>>> Then the amount
>>> of CPU cycles between timer interrupts would increase and hopefully
>>> the guest can keep up. If the guest sleeps, time base could be
>>> accelerated to catch up with wall clock and then set back to 1:1 rate.
>> Can't follow you ATM, sorry. What should be slowed down then? And how
>> precisely?
> 
> I think vm_clock and everything that depends on vm_clock, also
> rtc_clock should be tied to vm_clock in this mode, not host_clock.

Let me check if I got this idea correctly: Instead of tuning just the
tick frequency of the affected timer device / sending its backlog in a
row, you rather want to tune the vm_clock correspondingly? Maybe a way
to abstract the required logic currently sitting only in the RTC for use
by other timer sources as well.

But just switching rtc_clock to vm_clock when the user wants host_clock
is obviously not an option. We would rather have to tune host_clock in
parallel.

Still, this does not answer:

- How do you want to detect lost timer ticks?

- What subsystem(s) keeps track of the backlog?

- And depending on the above: How to detect at all that a specific IRQ
  is a timer tick?

Jan
Paul Brook - May 27, 2010, 10:21 p.m.
> > In some cases we don't even do that, and just reschedule the event some
> > arbitrarily small amount of time later. This assumes the guest to do
> > useful work in that time. In a single threaded environment this is
> > probably true - qemu got enough CPU to inject the first interrupt, so
> > will probably manage to execute some guest code before the end of its
> > timeslice. In an environment where interrupt processing/delivery and
> > execution of the guest code happen in different threads this becomes
> > increasingly likely to fail.
> 
> So any voodoo around timer events is doomed to fail in some cases.

Depends on the level of voodoo.
My guess is that common guest operating systems require hacks which result in 
demonstrably incorrect behavior.

> What's the amount of hacks what we want then? Is there any generic
> solution, like slowing down the guest system to the point where we can
> guarantee the interrupt rate vs. CPU execution speed?

The "-icount N" option gives deterministic virtual realtime behavior, however 
teh guuest if completely decoupled from real-world time.
The "-icount auto" option gives semi-deterministic behavior while maintaining 
overall consistency with the real world.  This may introduce some small-scale 
time jitter, but will still satisfy all but the most demanding hard-real-time 
assumptions.

Neither of these options work with KVM. It may be possible to implement 
something using using performance counters.  I don't know how much additional 
overhead this would involve.

Paul
Paul Brook - May 27, 2010, 10:21 p.m.
> >> Then the amount
> >> of CPU cycles between timer interrupts would increase and hopefully
> >> the guest can keep up. If the guest sleeps, time base could be
> >> accelerated to catch up with wall clock and then set back to 1:1 rate.
> > 
> > Can't follow you ATM, sorry. What should be slowed down then? And how
> > precisely?
> 
> I think vm_clock and everything that depends on vm_clock, also
> rtc_clock should be tied to vm_clock in this mode, not host_clock.

The problem is more fundamental than that. There is no real correlation 
between vm_clock and the amount of code executed by the guest, especially not 
on timescales less than a second.

Paul
Gleb Natapov - May 28, 2010, 7:31 a.m.
On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
> > On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
> >> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >> > Blue Swirl wrote:
> >> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >> >>> Anthony Liguori wrote:
> >> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
> >> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
> >> >>>>>
> >> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
> >> >>>>>>
> >> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
> >> >>>>>> the sink back to the source. Targets that support IRQ coalescing
> >> >>>>>> workarounds need to register handlers that return the appropriate
> >> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> >> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> >> >>>>>> apply its workaround. If multiple sinks exist, the source may only
> >> >>>>>> consider an IRQ coalesced if all other sinks either report
> >> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> >> >>>>>>
> >> >>>>> No real devices are interested whether any of their output lines are
> >> >>>>> even connected. This would introduce a new signal type, bidirectional
> >> >>>>> multi-level, which is not correct.
> >> >>>>>
> >> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
> >> >>>> to a suggestion that we ought to introduce a new signal type for this
> >> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
> >> >>>> similar interface as qemu_irq.
> >> >>> A separate type would complicate the delivery of the feedback value
> >> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
> >> >>>
> >> >>>>> I think the real solution to coalescing is put the logic inside one
> >> >>>>> device, in this case APIC because it has the information about irq
> >> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
> >> >>>>> information and whether they get delivered or not. If not, an internal
> >> >>>>> timer is installed which injects the lost irqs.
> >> >>> That won't fly as the IRQs will already arrive at the APIC with a
> >> >>> sufficiently high jitter. At the bare minimum, you need to tell the
> >> >>> interrupt controller about the fact that a particular IRQ should be
> >> >>> delivered at a specific regular rate. For this, you also need a generic
> >> >>> interface - nothing really "won".
> >> >>
> >> >> OK, let's simplify: just reinject at next possible chance. No need to
> >> >> monitor or tell anything.
> >> >
> >> > There are guests that won't like this (I know of one in-house, but
> >> > others may even have more examples), specifically if you end up firing
> >> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
> >> > spreads the reinjection according to the current rate.
> >>
> >> Then reinject with a constant delay, or next CPU exit. Such buggy
> > If guest's time frequency is the same as host time frequency you can't
> > reinject with constant delay. That is why current code mixes two
> > approaches: reinject M interrupts in a raw then delay.
> 
> This approach can be also used by APIC-only version.
> 
I don't know what APIC-only version you are talking about. I haven't
seen the code and I don't understand hand waving, sorry.

> >> guests could also be assisted with special handling (like win2k
> >> install hack), for example guest instructions could be counted
> >> (approximately, for example using TB size or TSC) and only inject
> >> after at least N instructions have passed.
> > Guest instructions cannot be easily counted in KVM (it can be done more
> > or less reliably using perf counters, may be).
> 
> Aren't there any debug registers or perf counters, which can generate
> an interrupt after some number of instructions have been executed?
Don't think debug registers have something like that and they are
available for guest use anyway. Perf counters differs greatly from CPU
to CPU (even between two CPUs of the same manufacturer), and we want to
keep using them for profiling guests. And I don't see what problem it
will solve anyway that can be solved by simple delay between irq
reinjection.

> 
> >>
> >> > And even if the rate did not matter, the APIC woult still have to now
> >> > about the fact that an IRQ is really periodic and does not only appear
> >> > as such for a certain interval. This really does not sound like
> >> > simplifying things or even make them cleaner.
> >>
> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> just like any other device. With the bidirectional irqs, this voodoo
> >> would probably eventually spread to many other devices. The logical
> >> conclusion of that would be a system where all devices would be
> >> careful not to disturb the guest at wrong moment because that would
> >> trigger a bug.
> >>
> > This voodoo will be so complex and unreliable that it will make RTC hack
> > pale in comparison (and I still don't see how you are going to make it
> > actually work).
> 
> Implement everything inside APIC: only coalescing and reinjection.
APIC has zero info needed to implement reinjection correctly as was
shown to you several time in this thread and you simply keep ignoring
it.

> Maybe that version would not bend backwards as much as the current to
> cater for buggy hosts.
> 
You mean "buggy guests"? What guests are not buggy in your opinion?
Linux tries hard to be smart and as a result the only way to have stable
clock with it is to go paravirt.

> > The fact is that timer device is not "just like any
> > other device" in virtual world. Any other device is easy: you just
> > implement spec as close as possible and everything works. For time
> > source device this is not enough. You can implement RTC+HPET to the
> > letter and your guest will drift like crazy.
> 
> It's doable: a cycle accurate emulator will not cause any drift,
> without any voodoo. The interrupts would come after executing the same
> instruction as the real HW. For emulating any sufficiently buggy
> guests in any sufficiently desperate low resource conditions, this may
> be the only option that will always work.
> 
Yes, but qemu and kvm are not cycle accurate emulators and don't strive
to be one. On the contrary KVM runs at native host CPU speed most of the
time, so any emulation done between two instruction is theoretically
noticeable for a guest. TSC is bypassed directly to a guest too, so
keeping all time source in perfect sync is also impossible.

> >
> >> At the other extreme, would it be possible to make the educated guests
> >> aware of the virtualization also in clock aspect: virtio-clock?
> > No.
> >
> > --
> >                        Gleb.
> >

--
			Gleb.
Blue Swirl - May 28, 2010, 7 p.m.
On Thu, May 27, 2010 at 10:19 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Thu, May 27, 2010 at 7:08 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> Blue Swirl wrote:
>>>> On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>> Blue Swirl wrote:
>>>>>> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>>>>>>> At the other extreme, would it be possible to make the educated guests
>>>>>>>> aware of the virtualization also in clock aspect: virtio-clock?
>>>>>>> The guest doesn't even need to be aware of virtualization. It just needs to be
>>>>>>> able to accommodate the lack of guaranteed realtime behavior.
>>>>>>>
>>>>>>> The fundamental problem here is that some guest operating systems assume that
>>>>>>> the hardware provides certain realtime guarantees with respect to execution of
>>>>>>> interrupt handlers.  In particular they assume that the CPU will always be
>>>>>>> able to complete execution of the timer IRQ handler before the periodic timer
>>>>>>> triggers again.  In most virtualized environments you have absolutely no
>>>>>>> guarantee of realtime response.
>>>>>>>
>>>>>>> With Linux guests this was solved a long time ago by the introduction of
>>>>>>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>>>>>>> doesn't matter if several wakeup triggers end up getting merged (either at the
>>>>>>> hardware level or via top/bottom half guest IRQ handlers).
>>>>>>>
>>>>>>>
>>>>>>> It's worth mentioning that this problem also occurs on real hardware,
>>>>>>> typically due to lame hardware/drivers which end up masking interrupts or
>>>>>>> otherwise stall the CPU for for long periods of time.
>>>>>>>
>>>>>>>
>>>>>>> The PIT hack attempts to workaround broken guests by adding artificial latency
>>>>>>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>>>>>>> guests vary on when it is safe for them to see the next timer event, and
>>>>>>> trying to observe this behavior involves potentially harmful heuristics and
>>>>>>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>>>>>>
>>>>>>> In some cases we don't even do that, and just reschedule the event some
>>>>>>> arbitrarily small amount of time later. This assumes the guest to do useful
>>>>>>> work in that time. In a single threaded environment this is probably true -
>>>>>>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>>>>>>> execute some guest code before the end of its timeslice. In an environment
>>>>>>> where interrupt processing/delivery and execution of the guest code happen in
>>>>>>> different threads this becomes increasingly likely to fail.
>>>>>> So any voodoo around timer events is doomed to fail in some cases.
>>>>>> What's the amount of hacks what we want then? Is there any generic
>>>>> The aim of this patch is to reduce the amount of existing and upcoming
>>>>> hacks. It may still require some refinements, but I think we haven't
>>>>> found any smarter approach yet that fits existing use cases.
>>>> I don't feel we have tried other possibilities hard enough.
>>> Well, seeing prototypes wouldn't be bad, also to run real load againt
>>> them. But at least I'm currently clueless what to implement.
>>
>> Perhaps now is then not the time to rush to implement something, but
>> to brainstorm for a clean solution.
>
> And sometimes it can help to understand how ideas could even be improved
> or why others doesn't work at all.
>
>>
>>>>>> solution, like slowing down the guest system to the point where we can
>>>>>> guarantee the interrupt rate vs. CPU execution speed?
>>>>> That's generally a non-option in virtualized production environments.
>>>>> Specifically if the guest system lost interrupts due to host
>>>>> overcommitment, you do not want it slow down even further.
>>>> I meant that the guest time could be scaled down, for example 2s in
>>>> wall clock time would be presented to the guest as 1s.
>>> But that is precisely what already happens when the guest loses timer
>>> interrupts. There is no other time source for this kind of guests -
>>> often except for some external events generated by systems which you
>>> don't want to fall behind arbitrarily.
>>>
>>>> Then the amount
>>>> of CPU cycles between timer interrupts would increase and hopefully
>>>> the guest can keep up. If the guest sleeps, time base could be
>>>> accelerated to catch up with wall clock and then set back to 1:1 rate.
>>> Can't follow you ATM, sorry. What should be slowed down then? And how
>>> precisely?
>>
>> I think vm_clock and everything that depends on vm_clock, also
>> rtc_clock should be tied to vm_clock in this mode, not host_clock.
>
> Let me check if I got this idea correctly: Instead of tuning just the
> tick frequency of the affected timer device / sending its backlog in a
> row, you rather want to tune the vm_clock correspondingly? Maybe a way
> to abstract the required logic currently sitting only in the RTC for use
> by other timer sources as well.

Yes, that would be a good starting point.

> But just switching rtc_clock to vm_clock when the user wants host_clock
> is obviously not an option. We would rather have to tune host_clock in
> parallel.
>
> Still, this does not answer:
>
> - How do you want to detect lost timer ticks?

With APIC, just like like now. But I think detecting lost ticks
shouldn't be the only way. It's an indication that things have gone
wrong, in the past. It would be better to use also other measurements
to see that the guest is close to a stall _before_ it happens.

> - What subsystem(s) keeps track of the backlog?

Preferably high level (vl.c, exec.c, cpu-exec.c).

> - And depending on the above: How to detect at all that a specific IRQ
>  is a timer tick?

Actually any IRQs acks can be delayed, those could be taken as a sign
of guest overload.
Blue Swirl - May 28, 2010, 7:10 p.m.
On Thu, May 27, 2010 at 10:21 PM, Paul Brook <paul@codesourcery.com> wrote:
>
>> >> Then the amount
>> >> of CPU cycles between timer interrupts would increase and hopefully
>> >> the guest can keep up. If the guest sleeps, time base could be
>> >> accelerated to catch up with wall clock and then set back to 1:1 rate.
>> >
>> > Can't follow you ATM, sorry. What should be slowed down then? And how
>> > precisely?
>>
>> I think vm_clock and everything that depends on vm_clock, also
>> rtc_clock should be tied to vm_clock in this mode, not host_clock.
>
> The problem is more fundamental than that. There is no real correlation
> between vm_clock and the amount of code executed by the guest, especially not
> on timescales less than a second.

Can we measure (or at least estimate with higher accuracy than the
tick IRQ delivery jitter) the amount of code executed, somehow? For
example, add TSC sampling to all TB or KVM VCPU exit and load/store
paths?
Blue Swirl - May 28, 2010, 8:06 p.m.
2010/5/28 Gleb Natapov <gleb@redhat.com>:
> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>> > On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>> >> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> >> > Blue Swirl wrote:
>> >> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> >> >>> Anthony Liguori wrote:
>> >> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>> >> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>> >> >>>>>
>> >> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>> >> >>>>>>
>> >> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
>> >> >>>>>> the sink back to the source. Targets that support IRQ coalescing
>> >> >>>>>> workarounds need to register handlers that return the appropriate
>> >> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>> >> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>> >> >>>>>> apply its workaround. If multiple sinks exist, the source may only
>> >> >>>>>> consider an IRQ coalesced if all other sinks either report
>> >> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>> >> >>>>>>
>> >> >>>>> No real devices are interested whether any of their output lines are
>> >> >>>>> even connected. This would introduce a new signal type, bidirectional
>> >> >>>>> multi-level, which is not correct.
>> >> >>>>>
>> >> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
>> >> >>>> to a suggestion that we ought to introduce a new signal type for this
>> >> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>> >> >>>> similar interface as qemu_irq.
>> >> >>> A separate type would complicate the delivery of the feedback value
>> >> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>> >> >>>
>> >> >>>>> I think the real solution to coalescing is put the logic inside one
>> >> >>>>> device, in this case APIC because it has the information about irq
>> >> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>> >> >>>>> information and whether they get delivered or not. If not, an internal
>> >> >>>>> timer is installed which injects the lost irqs.
>> >> >>> That won't fly as the IRQs will already arrive at the APIC with a
>> >> >>> sufficiently high jitter. At the bare minimum, you need to tell the
>> >> >>> interrupt controller about the fact that a particular IRQ should be
>> >> >>> delivered at a specific regular rate. For this, you also need a generic
>> >> >>> interface - nothing really "won".
>> >> >>
>> >> >> OK, let's simplify: just reinject at next possible chance. No need to
>> >> >> monitor or tell anything.
>> >> >
>> >> > There are guests that won't like this (I know of one in-house, but
>> >> > others may even have more examples), specifically if you end up firing
>> >> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>> >> > spreads the reinjection according to the current rate.
>> >>
>> >> Then reinject with a constant delay, or next CPU exit. Such buggy
>> > If guest's time frequency is the same as host time frequency you can't
>> > reinject with constant delay. That is why current code mixes two
>> > approaches: reinject M interrupts in a raw then delay.
>>
>> This approach can be also used by APIC-only version.
>>
> I don't know what APIC-only version you are talking about. I haven't
> seen the code and I don't understand hand waving, sorry.

There is no code, because we're still at architecture design stage.

>> >> guests could also be assisted with special handling (like win2k
>> >> install hack), for example guest instructions could be counted
>> >> (approximately, for example using TB size or TSC) and only inject
>> >> after at least N instructions have passed.
>> > Guest instructions cannot be easily counted in KVM (it can be done more
>> > or less reliably using perf counters, may be).
>>
>> Aren't there any debug registers or perf counters, which can generate
>> an interrupt after some number of instructions have been executed?
> Don't think debug registers have something like that and they are
> available for guest use anyway. Perf counters differs greatly from CPU
> to CPU (even between two CPUs of the same manufacturer), and we want to
> keep using them for profiling guests. And I don't see what problem it
> will solve anyway that can be solved by simple delay between irq
> reinjection.

This would allow counting the executed instructions and limit it. Thus
we could emulate a 500MHz CPU on a 2GHz CPU more accurately.

>>
>> >>
>> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> > about the fact that an IRQ is really periodic and does not only appear
>> >> > as such for a certain interval. This really does not sound like
>> >> > simplifying things or even make them cleaner.
>> >>
>> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> would probably eventually spread to many other devices. The logical
>> >> conclusion of that would be a system where all devices would be
>> >> careful not to disturb the guest at wrong moment because that would
>> >> trigger a bug.
>> >>
>> > This voodoo will be so complex and unreliable that it will make RTC hack
>> > pale in comparison (and I still don't see how you are going to make it
>> > actually work).
>>
>> Implement everything inside APIC: only coalescing and reinjection.
> APIC has zero info needed to implement reinjection correctly as was
> shown to you several time in this thread and you simply keep ignoring
> it.

On the contrary, APIC is actually the only source of the IRQ ack
information. RTC hack would not work without APIC (or the
bidirectional IRQ) passing this info to RTC.

What APIC doesn't have now is the timer frequency or period info. This
is known by RTC and also higher levels managing the clocks.

I keep ignoring the idea that the current model, where both RTC and
APIC must somehow work together to make coalescing work, is the only
possible just because it is committed and it happens to work in some
cases. It would be much better to concentrate this to one place, APIC
or preferably higher level where it may benefit other timers too.
Provided of course that the other models can be made to work.

>> Maybe that version would not bend backwards as much as the current to
>> cater for buggy hosts.
>>
> You mean "buggy guests"?

Yes, sorry.

> What guests are not buggy in your opinion?
> Linux tries hard to be smart and as a result the only way to have stable
> clock with it is to go paravirt.

I'm not an OS designer, but I think an OS should never crash, even if
a burst of IRQs is received. Reprogramming the timer should consider
the pending IRQ situation (0 or 1 with real HW). Those bugs are one
cause of the problem.

>> > The fact is that timer device is not "just like any
>> > other device" in virtual world. Any other device is easy: you just
>> > implement spec as close as possible and everything works. For time
>> > source device this is not enough. You can implement RTC+HPET to the
>> > letter and your guest will drift like crazy.
>>
>> It's doable: a cycle accurate emulator will not cause any drift,
>> without any voodoo. The interrupts would come after executing the same
>> instruction as the real HW. For emulating any sufficiently buggy
>> guests in any sufficiently desperate low resource conditions, this may
>> be the only option that will always work.
>>
> Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> to be one. On the contrary KVM runs at native host CPU speed most of the
> time, so any emulation done between two instruction is theoretically
> noticeable for a guest. TSC is bypassed directly to a guest too, so
> keeping all time source in perfect sync is also impossible.

That is actually another cause of the problem. KVM gives the guest an
illusion that the VCPU speed is equal to host speed. When they don't
match, especially in critical code, there can be problems. It would be
better to tell the guest a lower speed, which also can be guaranteed.

Maybe we should also offline the device emulation to another host CPU
with threading. A load from a device will always be much slower than
on real HW though.
Gleb Natapov - May 28, 2010, 8:47 p.m.
On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
> > On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
> >> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
> >> > On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
> >> >> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >> >> > Blue Swirl wrote:
> >> >> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >> >> >>> Anthony Liguori wrote:
> >> >> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
> >> >> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
> >> >> >>>>>
> >> >> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
> >> >> >>>>>>
> >> >> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
> >> >> >>>>>> the sink back to the source. Targets that support IRQ coalescing
> >> >> >>>>>> workarounds need to register handlers that return the appropriate
> >> >> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> >> >> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> >> >> >>>>>> apply its workaround. If multiple sinks exist, the source may only
> >> >> >>>>>> consider an IRQ coalesced if all other sinks either report
> >> >> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> >> >> >>>>>>
> >> >> >>>>> No real devices are interested whether any of their output lines are
> >> >> >>>>> even connected. This would introduce a new signal type, bidirectional
> >> >> >>>>> multi-level, which is not correct.
> >> >> >>>>>
> >> >> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
> >> >> >>>> to a suggestion that we ought to introduce a new signal type for this
> >> >> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
> >> >> >>>> similar interface as qemu_irq.
> >> >> >>> A separate type would complicate the delivery of the feedback value
> >> >> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
> >> >> >>>
> >> >> >>>>> I think the real solution to coalescing is put the logic inside one
> >> >> >>>>> device, in this case APIC because it has the information about irq
> >> >> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
> >> >> >>>>> information and whether they get delivered or not. If not, an internal
> >> >> >>>>> timer is installed which injects the lost irqs.
> >> >> >>> That won't fly as the IRQs will already arrive at the APIC with a
> >> >> >>> sufficiently high jitter. At the bare minimum, you need to tell the
> >> >> >>> interrupt controller about the fact that a particular IRQ should be
> >> >> >>> delivered at a specific regular rate. For this, you also need a generic
> >> >> >>> interface - nothing really "won".
> >> >> >>
> >> >> >> OK, let's simplify: just reinject at next possible chance. No need to
> >> >> >> monitor or tell anything.
> >> >> >
> >> >> > There are guests that won't like this (I know of one in-house, but
> >> >> > others may even have more examples), specifically if you end up firing
> >> >> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
> >> >> > spreads the reinjection according to the current rate.
> >> >>
> >> >> Then reinject with a constant delay, or next CPU exit. Such buggy
> >> > If guest's time frequency is the same as host time frequency you can't
> >> > reinject with constant delay. That is why current code mixes two
> >> > approaches: reinject M interrupts in a raw then delay.
> >>
> >> This approach can be also used by APIC-only version.
> >>
> > I don't know what APIC-only version you are talking about. I haven't
> > seen the code and I don't understand hand waving, sorry.
> 
> There is no code, because we're still at architecture design stage.
> 
Try to write test code to understand the problem better.

> >> >> guests could also be assisted with special handling (like win2k
> >> >> install hack), for example guest instructions could be counted
> >> >> (approximately, for example using TB size or TSC) and only inject
> >> >> after at least N instructions have passed.
> >> > Guest instructions cannot be easily counted in KVM (it can be done more
> >> > or less reliably using perf counters, may be).
> >>
> >> Aren't there any debug registers or perf counters, which can generate
> >> an interrupt after some number of instructions have been executed?
> > Don't think debug registers have something like that and they are
> > available for guest use anyway. Perf counters differs greatly from CPU
> > to CPU (even between two CPUs of the same manufacturer), and we want to
> > keep using them for profiling guests. And I don't see what problem it
> > will solve anyway that can be solved by simple delay between irq
> > reinjection.
> 
> This would allow counting the executed instructions and limit it. Thus
> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
> 
Why would you want to limit number of instruction executed by guest if
CPU has nothing else to do anyway? The problem occurs not when we have
spare cycles so give to a guest, but in opposite case.

> >>
> >> >>
> >> >> > And even if the rate did not matter, the APIC woult still have to now
> >> >> > about the fact that an IRQ is really periodic and does not only appear
> >> >> > as such for a certain interval. This really does not sound like
> >> >> > simplifying things or even make them cleaner.
> >> >>
> >> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> >> just like any other device. With the bidirectional irqs, this voodoo
> >> >> would probably eventually spread to many other devices. The logical
> >> >> conclusion of that would be a system where all devices would be
> >> >> careful not to disturb the guest at wrong moment because that would
> >> >> trigger a bug.
> >> >>
> >> > This voodoo will be so complex and unreliable that it will make RTC hack
> >> > pale in comparison (and I still don't see how you are going to make it
> >> > actually work).
> >>
> >> Implement everything inside APIC: only coalescing and reinjection.
> > APIC has zero info needed to implement reinjection correctly as was
> > shown to you several time in this thread and you simply keep ignoring
> > it.
> 
> On the contrary, APIC is actually the only source of the IRQ ack
> information. RTC hack would not work without APIC (or the
> bidirectional IRQ) passing this info to RTC.
> 
> What APIC doesn't have now is the timer frequency or period info. This
> is known by RTC and also higher levels managing the clocks.
> 
So APIC has one bit of information and RTC everything else. The current
approach (and proposed patch) brings this one bit of information to RTC,
you are arguing that RTC should be able to communicate all its info to
APIC. Sorry I don't see that your way has any advantage. Just more
complex interface and it is much easier to get it wrong for other time
sources.

> I keep ignoring the idea that the current model, where both RTC and
> APIC must somehow work together to make coalescing work, is the only
> possible just because it is committed and it happens to work in some
> cases. It would be much better to concentrate this to one place, APIC
> or preferably higher level where it may benefit other timers too.
> Provided of course that the other models can be made to work.
> 
So write the code and show us. You haven't show any evidence that RTC is
the wrong place. RTC knows when interrupt was acknowledge to RTC, it
know when clock frequency changes, it know when device reset happened.
APIC knows only that interrupt was coalesced. It doesn't even know that
it may be masked by a guest in IOAPIC (interrupts delivered while they
are masked not considered coalesced). Time source knows only when
frequency changes and may be when device reset happens if timer is
stopped by device on reset. So RTC is actually a sweet spot if you want
to minimize amount of info you need to pass between various layers.

> >> Maybe that version would not bend backwards as much as the current to
> >> cater for buggy hosts.
> >>
> > You mean "buggy guests"?
> 
> Yes, sorry.
> 
> > What guests are not buggy in your opinion?
> > Linux tries hard to be smart and as a result the only way to have stable
> > clock with it is to go paravirt.
> 
> I'm not an OS designer, but I think an OS should never crash, even if
> a burst of IRQs is received. Reprogramming the timer should consider
> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
> cause of the problem.
OS should never crash in the absence of HW bugs? I doubt you can design
an OS that can run in a face of any HW failure. Anyway here we are
trying to solve guests time keeping problem not crashes. Do you think
you can design OS that can keep time accurately no matter how crazy all
HW clock behaves?

> 
> >> > The fact is that timer device is not "just like any
> >> > other device" in virtual world. Any other device is easy: you just
> >> > implement spec as close as possible and everything works. For time
> >> > source device this is not enough. You can implement RTC+HPET to the
> >> > letter and your guest will drift like crazy.
> >>
> >> It's doable: a cycle accurate emulator will not cause any drift,
> >> without any voodoo. The interrupts would come after executing the same
> >> instruction as the real HW. For emulating any sufficiently buggy
> >> guests in any sufficiently desperate low resource conditions, this may
> >> be the only option that will always work.
> >>
> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> > to be one. On the contrary KVM runs at native host CPU speed most of the
> > time, so any emulation done between two instruction is theoretically
> > noticeable for a guest. TSC is bypassed directly to a guest too, so
> > keeping all time source in perfect sync is also impossible.
> 
> That is actually another cause of the problem. KVM gives the guest an
> illusion that the VCPU speed is equal to host speed. When they don't
> match, especially in critical code, there can be problems. It would be
> better to tell the guest a lower speed, which also can be guaranteed.
> 
Not possible. It's that simple. You should take it into account in your
architecture design stage. In case of KVM real physical CPU executes guest
instruction and it does this as fast as it can. The only way we can hide
that from a guest is by intercepting each access to TSC and at that
point we can use bochs instead.

> Maybe we should also offline the device emulation to another host CPU
> with threading. A load from a device will always be much slower than
> on real HW though.
Time drift problem start to happen on loaded servers, so you do not have
spare CPU to offload device emulation too.

--
			Gleb.
Jan Kiszka - May 29, 2010, 7:58 a.m.
Gleb Natapov wrote:
> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
>>> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>>>> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>>>>> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>>>>>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>> Blue Swirl wrote:
>>>>>>>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>> Anthony Liguori wrote:
>>>>>>>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>>>>>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>>>>>>>>
>>>>>>>>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>>>>>>>>
>>>>>>>>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>>>>>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>>>>>>>>> workarounds need to register handlers that return the appropriate
>>>>>>>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>>>>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>>>>>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>>>>>>>>> consider an IRQ coalesced if all other sinks either report
>>>>>>>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>>>>>>>>
>>>>>>>>>>> No real devices are interested whether any of their output lines are
>>>>>>>>>>> even connected. This would introduce a new signal type, bidirectional
>>>>>>>>>>> multi-level, which is not correct.
>>>>>>>>>>>
>>>>>>>>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>>>>>>>>> to a suggestion that we ought to introduce a new signal type for this
>>>>>>>>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>>>>>>>>> similar interface as qemu_irq.
>>>>>>>>> A separate type would complicate the delivery of the feedback value
>>>>>>>>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>>>>>>>>
>>>>>>>>>>> I think the real solution to coalescing is put the logic inside one
>>>>>>>>>>> device, in this case APIC because it has the information about irq
>>>>>>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>>>>>>>> information and whether they get delivered or not. If not, an internal
>>>>>>>>>>> timer is installed which injects the lost irqs.
>>>>>>>>> That won't fly as the IRQs will already arrive at the APIC with a
>>>>>>>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>>>>>>>> interrupt controller about the fact that a particular IRQ should be
>>>>>>>>> delivered at a specific regular rate. For this, you also need a generic
>>>>>>>>> interface - nothing really "won".
>>>>>>>> OK, let's simplify: just reinject at next possible chance. No need to
>>>>>>>> monitor or tell anything.
>>>>>>> There are guests that won't like this (I know of one in-house, but
>>>>>>> others may even have more examples), specifically if you end up firing
>>>>>>> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>>>>>>> spreads the reinjection according to the current rate.
>>>>>> Then reinject with a constant delay, or next CPU exit. Such buggy
>>>>> If guest's time frequency is the same as host time frequency you can't
>>>>> reinject with constant delay. That is why current code mixes two
>>>>> approaches: reinject M interrupts in a raw then delay.
>>>> This approach can be also used by APIC-only version.
>>>>
>>> I don't know what APIC-only version you are talking about. I haven't
>>> seen the code and I don't understand hand waving, sorry.
>> There is no code, because we're still at architecture design stage.
>>
> Try to write test code to understand the problem better.
> 
>>>>>> guests could also be assisted with special handling (like win2k
>>>>>> install hack), for example guest instructions could be counted
>>>>>> (approximately, for example using TB size or TSC) and only inject
>>>>>> after at least N instructions have passed.
>>>>> Guest instructions cannot be easily counted in KVM (it can be done more
>>>>> or less reliably using perf counters, may be).
>>>> Aren't there any debug registers or perf counters, which can generate
>>>> an interrupt after some number of instructions have been executed?
>>> Don't think debug registers have something like that and they are
>>> available for guest use anyway. Perf counters differs greatly from CPU
>>> to CPU (even between two CPUs of the same manufacturer), and we want to
>>> keep using them for profiling guests. And I don't see what problem it
>>> will solve anyway that can be solved by simple delay between irq
>>> reinjection.
>> This would allow counting the executed instructions and limit it. Thus
>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>
> Why would you want to limit number of instruction executed by guest if
> CPU has nothing else to do anyway? The problem occurs not when we have
> spare cycles so give to a guest, but in opposite case.
> 
>>>>>>> And even if the rate did not matter, the APIC woult still have to now
>>>>>>> about the fact that an IRQ is really periodic and does not only appear
>>>>>>> as such for a certain interval. This really does not sound like
>>>>>>> simplifying things or even make them cleaner.
>>>>>> It would, the voodoo would be contained only in APIC, RTC would be
>>>>>> just like any other device. With the bidirectional irqs, this voodoo
>>>>>> would probably eventually spread to many other devices. The logical
>>>>>> conclusion of that would be a system where all devices would be
>>>>>> careful not to disturb the guest at wrong moment because that would
>>>>>> trigger a bug.
>>>>>>
>>>>> This voodoo will be so complex and unreliable that it will make RTC hack
>>>>> pale in comparison (and I still don't see how you are going to make it
>>>>> actually work).
>>>> Implement everything inside APIC: only coalescing and reinjection.
>>> APIC has zero info needed to implement reinjection correctly as was
>>> shown to you several time in this thread and you simply keep ignoring
>>> it.
>> On the contrary, APIC is actually the only source of the IRQ ack
>> information. RTC hack would not work without APIC (or the
>> bidirectional IRQ) passing this info to RTC.
>>
>> What APIC doesn't have now is the timer frequency or period info. This
>> is known by RTC and also higher levels managing the clocks.
>>
> So APIC has one bit of information and RTC everything else. The current
> approach (and proposed patch) brings this one bit of information to RTC,
> you are arguing that RTC should be able to communicate all its info to
> APIC. Sorry I don't see that your way has any advantage. Just more
> complex interface and it is much easier to get it wrong for other time
> sources.
> 
>> I keep ignoring the idea that the current model, where both RTC and
>> APIC must somehow work together to make coalescing work, is the only
>> possible just because it is committed and it happens to work in some
>> cases. It would be much better to concentrate this to one place, APIC
>> or preferably higher level where it may benefit other timers too.
>> Provided of course that the other models can be made to work.
>>
> So write the code and show us. You haven't show any evidence that RTC is
> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> know when clock frequency changes, it know when device reset happened.
> APIC knows only that interrupt was coalesced. It doesn't even know that
> it may be masked by a guest in IOAPIC (interrupts delivered while they
> are masked not considered coalesced). Time source knows only when
> frequency changes and may be when device reset happens if timer is
> stopped by device on reset. So RTC is actually a sweet spot if you want
> to minimize amount of info you need to pass between various layers.

This is - according to my current understanding - the proposed
alternative architecture:

                                          .---------------.
                                          | de-coalescing |
                                          |     logic     |
                                          '---------------'
                                                ^   |
                                         period,|   |IRQ
                                       coalesced|   |(or tuned VM clock)
                                        (yes/no)|   v
.-------.              .--------.             .-------.
|  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
'-------'              '--------'             '-------'
    |                    ^    |                   ^
    |                    |    |                   |
    '-------period-------'    '------period-------'


And here is what this patch implements (except for the not yet factored
out de-coalescing logic):

   .---------------.
   | de-coalescing |
   |     logic     |
   '---------------'
         ^   |
  period,|   |next IRQ date
coalesced|   |(or tuned VM clock)
 (yes/no)|   v
       .-------.              .--------.             .-------.
       |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
       '-------'              '--------'             '-------'
           ^                    |    ^                   |
           |                    |    |                   |
           '------coalesced-----'    '-----coalesced-----'


I still don't see how the alternative is supposed to simplify our life
or improve the efficiency of the de-coalescing workaround. It's rather
problematic like Gleb pointed out: The de-coalescing logic needs to be
informed about periodicity changes that can only be delivered along
IRQs. So what to do with the backlog when the timer is stopped?

Regarding an improved de-coalescing logic: As its final design is widely
independent of how we collect the information, it could perfectly be
done after we laid the elementary foundation.

Jan
Blue Swirl - May 29, 2010, 9:15 a.m.
2010/5/28 Gleb Natapov <gleb@redhat.com>:
> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
>> > On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>> >> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>> >> > On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>> >> >> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> >> >> > Blue Swirl wrote:
>> >> >> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> >> >> >>> Anthony Liguori wrote:
>> >> >> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>> >> >> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>> >> >> >>>>>
>> >> >> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>> >> >> >>>>>>
>> >> >> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
>> >> >> >>>>>> the sink back to the source. Targets that support IRQ coalescing
>> >> >> >>>>>> workarounds need to register handlers that return the appropriate
>> >> >> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>> >> >> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>> >> >> >>>>>> apply its workaround. If multiple sinks exist, the source may only
>> >> >> >>>>>> consider an IRQ coalesced if all other sinks either report
>> >> >> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>> >> >> >>>>>>
>> >> >> >>>>> No real devices are interested whether any of their output lines are
>> >> >> >>>>> even connected. This would introduce a new signal type, bidirectional
>> >> >> >>>>> multi-level, which is not correct.
>> >> >> >>>>>
>> >> >> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
>> >> >> >>>> to a suggestion that we ought to introduce a new signal type for this
>> >> >> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>> >> >> >>>> similar interface as qemu_irq.
>> >> >> >>> A separate type would complicate the delivery of the feedback value
>> >> >> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>> >> >> >>>
>> >> >> >>>>> I think the real solution to coalescing is put the logic inside one
>> >> >> >>>>> device, in this case APIC because it has the information about irq
>> >> >> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>> >> >> >>>>> information and whether they get delivered or not. If not, an internal
>> >> >> >>>>> timer is installed which injects the lost irqs.
>> >> >> >>> That won't fly as the IRQs will already arrive at the APIC with a
>> >> >> >>> sufficiently high jitter. At the bare minimum, you need to tell the
>> >> >> >>> interrupt controller about the fact that a particular IRQ should be
>> >> >> >>> delivered at a specific regular rate. For this, you also need a generic
>> >> >> >>> interface - nothing really "won".
>> >> >> >>
>> >> >> >> OK, let's simplify: just reinject at next possible chance. No need to
>> >> >> >> monitor or tell anything.
>> >> >> >
>> >> >> > There are guests that won't like this (I know of one in-house, but
>> >> >> > others may even have more examples), specifically if you end up firing
>> >> >> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>> >> >> > spreads the reinjection according to the current rate.
>> >> >>
>> >> >> Then reinject with a constant delay, or next CPU exit. Such buggy
>> >> > If guest's time frequency is the same as host time frequency you can't
>> >> > reinject with constant delay. That is why current code mixes two
>> >> > approaches: reinject M interrupts in a raw then delay.
>> >>
>> >> This approach can be also used by APIC-only version.
>> >>
>> > I don't know what APIC-only version you are talking about. I haven't
>> > seen the code and I don't understand hand waving, sorry.
>>
>> There is no code, because we're still at architecture design stage.
>>
> Try to write test code to understand the problem better.

I will.

>> >> >> guests could also be assisted with special handling (like win2k
>> >> >> install hack), for example guest instructions could be counted
>> >> >> (approximately, for example using TB size or TSC) and only inject
>> >> >> after at least N instructions have passed.
>> >> > Guest instructions cannot be easily counted in KVM (it can be done more
>> >> > or less reliably using perf counters, may be).
>> >>
>> >> Aren't there any debug registers or perf counters, which can generate
>> >> an interrupt after some number of instructions have been executed?
>> > Don't think debug registers have something like that and they are
>> > available for guest use anyway. Perf counters differs greatly from CPU
>> > to CPU (even between two CPUs of the same manufacturer), and we want to
>> > keep using them for profiling guests. And I don't see what problem it
>> > will solve anyway that can be solved by simple delay between irq
>> > reinjection.
>>
>> This would allow counting the executed instructions and limit it. Thus
>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>
> Why would you want to limit number of instruction executed by guest if
> CPU has nothing else to do anyway? The problem occurs not when we have
> spare cycles so give to a guest, but in opposite case.

I think one problem is that the guest has executed too much compared
to what would happen with real HW with a lesser CPU. That explains the
RTC frequency reprogramming case.

>
>> >>
>> >> >>
>> >> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> >> > about the fact that an IRQ is really periodic and does not only appear
>> >> >> > as such for a certain interval. This really does not sound like
>> >> >> > simplifying things or even make them cleaner.
>> >> >>
>> >> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> >> would probably eventually spread to many other devices. The logical
>> >> >> conclusion of that would be a system where all devices would be
>> >> >> careful not to disturb the guest at wrong moment because that would
>> >> >> trigger a bug.
>> >> >>
>> >> > This voodoo will be so complex and unreliable that it will make RTC hack
>> >> > pale in comparison (and I still don't see how you are going to make it
>> >> > actually work).
>> >>
>> >> Implement everything inside APIC: only coalescing and reinjection.
>> > APIC has zero info needed to implement reinjection correctly as was
>> > shown to you several time in this thread and you simply keep ignoring
>> > it.
>>
>> On the contrary, APIC is actually the only source of the IRQ ack
>> information. RTC hack would not work without APIC (or the
>> bidirectional IRQ) passing this info to RTC.
>>
>> What APIC doesn't have now is the timer frequency or period info. This
>> is known by RTC and also higher levels managing the clocks.
>>
> So APIC has one bit of information and RTC everything else.

The information known by RTC (timer period) is also known by higher levels.

> The current
> approach (and proposed patch) brings this one bit of information to RTC,
> you are arguing that RTC should be able to communicate all its info to
> APIC. Sorry I don't see that your way has any advantage. Just more
> complex interface and it is much easier to get it wrong for other time
> sources.

I don't think anymore that APIC should be handling this but the
generic stuff, like vl.c or exec.c. Then there would be only
information passing from APIC to higher levels.

>> I keep ignoring the idea that the current model, where both RTC and
>> APIC must somehow work together to make coalescing work, is the only
>> possible just because it is committed and it happens to work in some
>> cases. It would be much better to concentrate this to one place, APIC
>> or preferably higher level where it may benefit other timers too.
>> Provided of course that the other models can be made to work.
>>
> So write the code and show us. You haven't show any evidence that RTC is
> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> know when clock frequency changes, it know when device reset happened.
> APIC knows only that interrupt was coalesced. It doesn't even know that
> it may be masked by a guest in IOAPIC (interrupts delivered while they
> are masked not considered coalesced).

Oh, I thought interrupt masking was the reason for coalescing! What
exactly is the reason then?

> Time source knows only when
> frequency changes and may be when device reset happens if timer is
> stopped by device on reset. So RTC is actually a sweet spot if you want
> to minimize amount of info you need to pass between various layers.
>
>> >> Maybe that version would not bend backwards as much as the current to
>> >> cater for buggy hosts.
>> >>
>> > You mean "buggy guests"?
>>
>> Yes, sorry.
>>
>> > What guests are not buggy in your opinion?
>> > Linux tries hard to be smart and as a result the only way to have stable
>> > clock with it is to go paravirt.
>>
>> I'm not an OS designer, but I think an OS should never crash, even if
>> a burst of IRQs is received. Reprogramming the timer should consider
>> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>> cause of the problem.
> OS should never crash in the absence of HW bugs? I doubt you can design
> an OS that can run in a face of any HW failure. Anyway here we are
> trying to solve guests time keeping problem not crashes. Do you think
> you can design OS that can keep time accurately no matter how crazy all
> HW clock behaves?

I think my OS design skills are not relevant in this discussion, but
IIRC there are fault tolerant operating systems for extreme conditions
so it can be done.

>
>>
>> >> > The fact is that timer device is not "just like any
>> >> > other device" in virtual world. Any other device is easy: you just
>> >> > implement spec as close as possible and everything works. For time
>> >> > source device this is not enough. You can implement RTC+HPET to the
>> >> > letter and your guest will drift like crazy.
>> >>
>> >> It's doable: a cycle accurate emulator will not cause any drift,
>> >> without any voodoo. The interrupts would come after executing the same
>> >> instruction as the real HW. For emulating any sufficiently buggy
>> >> guests in any sufficiently desperate low resource conditions, this may
>> >> be the only option that will always work.
>> >>
>> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
>> > to be one. On the contrary KVM runs at native host CPU speed most of the
>> > time, so any emulation done between two instruction is theoretically
>> > noticeable for a guest. TSC is bypassed directly to a guest too, so
>> > keeping all time source in perfect sync is also impossible.
>>
>> That is actually another cause of the problem. KVM gives the guest an
>> illusion that the VCPU speed is equal to host speed. When they don't
>> match, especially in critical code, there can be problems. It would be
>> better to tell the guest a lower speed, which also can be guaranteed.
>>
> Not possible. It's that simple. You should take it into account in your
> architecture design stage. In case of KVM real physical CPU executes guest
> instruction and it does this as fast as it can. The only way we can hide
> that from a guest is by intercepting each access to TSC and at that
> point we can use bochs instead.

Well, as Paul pointed out, there's also icount option.

>> Maybe we should also offline the device emulation to another host CPU
>> with threading. A load from a device will always be much slower than
>> on real HW though.
> Time drift problem start to happen on loaded servers, so you do not have
> spare CPU to offload device emulation too.
>
> --
>                        Gleb.
>
Blue Swirl - May 29, 2010, 9:35 a.m.
2010/5/29 Jan Kiszka <jan.kiszka@web.de>:
> Gleb Natapov wrote:
>> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>>> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
>>>> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>>>>> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>>>>>> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>>>>>>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>> Blue Swirl wrote:
>>>>>>>>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>>> Anthony Liguori wrote:
>>>>>>>>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>>>>>>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>>>>>>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>>>>>>>>>> workarounds need to register handlers that return the appropriate
>>>>>>>>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>>>>>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>>>>>>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>>>>>>>>>> consider an IRQ coalesced if all other sinks either report
>>>>>>>>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>>>>>>>>>
>>>>>>>>>>>> No real devices are interested whether any of their output lines are
>>>>>>>>>>>> even connected. This would introduce a new signal type, bidirectional
>>>>>>>>>>>> multi-level, which is not correct.
>>>>>>>>>>>>
>>>>>>>>>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>>>>>>>>>> to a suggestion that we ought to introduce a new signal type for this
>>>>>>>>>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>>>>>>>>>> similar interface as qemu_irq.
>>>>>>>>>> A separate type would complicate the delivery of the feedback value
>>>>>>>>>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>>>>>>>>>
>>>>>>>>>>>> I think the real solution to coalescing is put the logic inside one
>>>>>>>>>>>> device, in this case APIC because it has the information about irq
>>>>>>>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>>>>>>>>> information and whether they get delivered or not. If not, an internal
>>>>>>>>>>>> timer is installed which injects the lost irqs.
>>>>>>>>>> That won't fly as the IRQs will already arrive at the APIC with a
>>>>>>>>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>>>>>>>>> interrupt controller about the fact that a particular IRQ should be
>>>>>>>>>> delivered at a specific regular rate. For this, you also need a generic
>>>>>>>>>> interface - nothing really "won".
>>>>>>>>> OK, let's simplify: just reinject at next possible chance. No need to
>>>>>>>>> monitor or tell anything.
>>>>>>>> There are guests that won't like this (I know of one in-house, but
>>>>>>>> others may even have more examples), specifically if you end up firing
>>>>>>>> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>>>>>>>> spreads the reinjection according to the current rate.
>>>>>>> Then reinject with a constant delay, or next CPU exit. Such buggy
>>>>>> If guest's time frequency is the same as host time frequency you can't
>>>>>> reinject with constant delay. That is why current code mixes two
>>>>>> approaches: reinject M interrupts in a raw then delay.
>>>>> This approach can be also used by APIC-only version.
>>>>>
>>>> I don't know what APIC-only version you are talking about. I haven't
>>>> seen the code and I don't understand hand waving, sorry.
>>> There is no code, because we're still at architecture design stage.
>>>
>> Try to write test code to understand the problem better.
>>
>>>>>>> guests could also be assisted with special handling (like win2k
>>>>>>> install hack), for example guest instructions could be counted
>>>>>>> (approximately, for example using TB size or TSC) and only inject
>>>>>>> after at least N instructions have passed.
>>>>>> Guest instructions cannot be easily counted in KVM (it can be done more
>>>>>> or less reliably using perf counters, may be).
>>>>> Aren't there any debug registers or perf counters, which can generate
>>>>> an interrupt after some number of instructions have been executed?
>>>> Don't think debug registers have something like that and they are
>>>> available for guest use anyway. Perf counters differs greatly from CPU
>>>> to CPU (even between two CPUs of the same manufacturer), and we want to
>>>> keep using them for profiling guests. And I don't see what problem it
>>>> will solve anyway that can be solved by simple delay between irq
>>>> reinjection.
>>> This would allow counting the executed instructions and limit it. Thus
>>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>>
>> Why would you want to limit number of instruction executed by guest if
>> CPU has nothing else to do anyway? The problem occurs not when we have
>> spare cycles so give to a guest, but in opposite case.
>>
>>>>>>>> And even if the rate did not matter, the APIC woult still have to now
>>>>>>>> about the fact that an IRQ is really periodic and does not only appear
>>>>>>>> as such for a certain interval. This really does not sound like
>>>>>>>> simplifying things or even make them cleaner.
>>>>>>> It would, the voodoo would be contained only in APIC, RTC would be
>>>>>>> just like any other device. With the bidirectional irqs, this voodoo
>>>>>>> would probably eventually spread to many other devices. The logical
>>>>>>> conclusion of that would be a system where all devices would be
>>>>>>> careful not to disturb the guest at wrong moment because that would
>>>>>>> trigger a bug.
>>>>>>>
>>>>>> This voodoo will be so complex and unreliable that it will make RTC hack
>>>>>> pale in comparison (and I still don't see how you are going to make it
>>>>>> actually work).
>>>>> Implement everything inside APIC: only coalescing and reinjection.
>>>> APIC has zero info needed to implement reinjection correctly as was
>>>> shown to you several time in this thread and you simply keep ignoring
>>>> it.
>>> On the contrary, APIC is actually the only source of the IRQ ack
>>> information. RTC hack would not work without APIC (or the
>>> bidirectional IRQ) passing this info to RTC.
>>>
>>> What APIC doesn't have now is the timer frequency or period info. This
>>> is known by RTC and also higher levels managing the clocks.
>>>
>> So APIC has one bit of information and RTC everything else. The current
>> approach (and proposed patch) brings this one bit of information to RTC,
>> you are arguing that RTC should be able to communicate all its info to
>> APIC. Sorry I don't see that your way has any advantage. Just more
>> complex interface and it is much easier to get it wrong for other time
>> sources.
>>
>>> I keep ignoring the idea that the current model, where both RTC and
>>> APIC must somehow work together to make coalescing work, is the only
>>> possible just because it is committed and it happens to work in some
>>> cases. It would be much better to concentrate this to one place, APIC
>>> or preferably higher level where it may benefit other timers too.
>>> Provided of course that the other models can be made to work.
>>>
>> So write the code and show us. You haven't show any evidence that RTC is
>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> know when clock frequency changes, it know when device reset happened.
>> APIC knows only that interrupt was coalesced. It doesn't even know that
>> it may be masked by a guest in IOAPIC (interrupts delivered while they
>> are masked not considered coalesced). Time source knows only when
>> frequency changes and may be when device reset happens if timer is
>> stopped by device on reset. So RTC is actually a sweet spot if you want
>> to minimize amount of info you need to pass between various layers.
>
> This is - according to my current understanding - the proposed
> alternative architecture:
>
>                                          .---------------.
>                                          | de-coalescing |
>                                          |     logic     |
>                                          '---------------'
>                                                ^   |
>                                         period,|   |IRQ
>                                       coalesced|   |(or tuned VM clock)
>                                        (yes/no)|   v
> .-------.              .--------.             .-------.
> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
> '-------'              '--------'             '-------'
>    |                    ^    |                   ^
>    |                    |    |                   |
>    '-------period-------'    '------period-------'

The period information is already known by the higher level clock
management, it should be available (or made available) to
de-coalescing logic (which should probably be located close to other
clock stuff) but otherwise there shouldn't be a need to pass it
around. The tuned VM clock of course only affects the RTC. Otherwise
correct.

> And here is what this patch implements (except for the not yet factored
> out de-coalescing logic):
>
>   .---------------.
>   | de-coalescing |
>   |     logic     |
>   '---------------'
>         ^   |
>  period,|   |next IRQ date
> coalesced|   |(or tuned VM clock)
>  (yes/no)|   v
>       .-------.              .--------.             .-------.
>       |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>       '-------'              '--------'             '-------'
>           ^                    |    ^                   |
>           |                    |    |                   |
>           '------coalesced-----'    '-----coalesced-----'

Why not route the coalesced signal directly from APIC to de-coalescing
logic? Then our designs would match!

>
> I still don't see how the alternative is supposed to simplify our life
> or improve the efficiency of the de-coalescing workaround. It's rather
> problematic like Gleb pointed out: The de-coalescing logic needs to be
> informed about periodicity changes that can only be delivered along
> IRQs. So what to do with the backlog when the timer is stopped?

What happens with the current design? Gleb only mentioned the
frequency change, I thought that was not so big problem. But I don't
think this case should be allowed happen at all, it can't exist on
real HW.

> Regarding an improved de-coalescing logic: As its final design is widely
> independent of how we collect the information, it could perfectly be
> done after we laid the elementary foundation.

True. But what is the foundation, do we need bidirectional IRQs or
not? I still think current IRQs should be enough.
Jan Kiszka - May 29, 2010, 9:36 a.m.
Blue Swirl wrote:
>>> On the contrary, APIC is actually the only source of the IRQ ack
>>> information. RTC hack would not work without APIC (or the
>>> bidirectional IRQ) passing this info to RTC.
>>>
>>> What APIC doesn't have now is the timer frequency or period info. This
>>> is known by RTC and also higher levels managing the clocks.
>>>
>> So APIC has one bit of information and RTC everything else.
> 
> The information known by RTC (timer period) is also known by higher levels.

Curious to see where you'll find this.

> 
>> The current
>> approach (and proposed patch) brings this one bit of information to RTC,
>> you are arguing that RTC should be able to communicate all its info to
>> APIC. Sorry I don't see that your way has any advantage. Just more
>> complex interface and it is much easier to get it wrong for other time
>> sources.
> 
> I don't think anymore that APIC should be handling this but the
> generic stuff, like vl.c or exec.c. Then there would be only
> information passing from APIC to higher levels.

You neglect the the information required to associate a periodic source
(e.g. RTC) with an IRQ sink (e.g. APIC). Without that, you will have a
hard time figuring out if a reported IRQ coalescing requires any
activities or should simply be welcomed (for I/O IRQs).

> 
>>> I keep ignoring the idea that the current model, where both RTC and
>>> APIC must somehow work together to make coalescing work, is the only
>>> possible just because it is committed and it happens to work in some
>>> cases. It would be much better to concentrate this to one place, APIC
>>> or preferably higher level where it may benefit other timers too.
>>> Provided of course that the other models can be made to work.
>>>
>> So write the code and show us. You haven't show any evidence that RTC is
>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> know when clock frequency changes, it know when device reset happened.
>> APIC knows only that interrupt was coalesced. It doesn't even know that
>> it may be masked by a guest in IOAPIC (interrupts delivered while they
>> are masked not considered coalesced).
> 
> Oh, I thought interrupt masking was the reason for coalescing! What
> exactly is the reason then?

Missing acks, ie. the IRQ is still pending when the next one arrives.
You want to filter out masked/suppressed IRQs to avoid running the
de-coalescing logic on sources that are actually cut off (like the RTC
IRQ when the HPET took over).

> 
>> Time source knows only when
>> frequency changes and may be when device reset happens if timer is
>> stopped by device on reset. So RTC is actually a sweet spot if you want
>> to minimize amount of info you need to pass between various layers.
>>
>>>>> Maybe that version would not bend backwards as much as the current to
>>>>> cater for buggy hosts.
>>>>>
>>>> You mean "buggy guests"?
>>> Yes, sorry.
>>>
>>>> What guests are not buggy in your opinion?
>>>> Linux tries hard to be smart and as a result the only way to have stable
>>>> clock with it is to go paravirt.
>>> I'm not an OS designer, but I think an OS should never crash, even if
>>> a burst of IRQs is received. Reprogramming the timer should consider
>>> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>>> cause of the problem.
>> OS should never crash in the absence of HW bugs? I doubt you can design
>> an OS that can run in a face of any HW failure. Anyway here we are
>> trying to solve guests time keeping problem not crashes. Do you think
>> you can design OS that can keep time accurately no matter how crazy all
>> HW clock behaves?
> 
> I think my OS design skills are not relevant in this discussion, but
> IIRC there are fault tolerant operating systems for extreme conditions
> so it can be done.

No one can influence the design of released OS versions anymore.

> 
>>>>>> The fact is that timer device is not "just like any
>>>>>> other device" in virtual world. Any other device is easy: you just
>>>>>> implement spec as close as possible and everything works. For time
>>>>>> source device this is not enough. You can implement RTC+HPET to the
>>>>>> letter and your guest will drift like crazy.
>>>>> It's doable: a cycle accurate emulator will not cause any drift,
>>>>> without any voodoo. The interrupts would come after executing the same
>>>>> instruction as the real HW. For emulating any sufficiently buggy
>>>>> guests in any sufficiently desperate low resource conditions, this may
>>>>> be the only option that will always work.
>>>>>
>>>> Yes, but qemu and kvm are not cycle accurate emulators and don't strive
>>>> to be one. On the contrary KVM runs at native host CPU speed most of the
>>>> time, so any emulation done between two instruction is theoretically
>>>> noticeable for a guest. TSC is bypassed directly to a guest too, so
>>>> keeping all time source in perfect sync is also impossible.
>>> That is actually another cause of the problem. KVM gives the guest an
>>> illusion that the VCPU speed is equal to host speed. When they don't
>>> match, especially in critical code, there can be problems. It would be
>>> better to tell the guest a lower speed, which also can be guaranteed.
>>>
>> Not possible. It's that simple. You should take it into account in your
>> architecture design stage. In case of KVM real physical CPU executes guest
>> instruction and it does this as fast as it can. The only way we can hide
>> that from a guest is by intercepting each access to TSC and at that
>> point we can use bochs instead.
> 
> Well, as Paul pointed out, there's also icount option.

Which is not available in virtualization mode.

Jan
Jan Kiszka - May 29, 2010, 9:45 a.m.
Blue Swirl wrote:
> 2010/5/29 Jan Kiszka <jan.kiszka@web.de>:
>> Gleb Natapov wrote:
>>> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>>>> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
>>>>> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>>>>>> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>>>>>>> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>>>>>>>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>> Blue Swirl wrote:
>>>>>>>>>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>>>> Anthony Liguori wrote:
>>>>>>>>>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>>>>>>>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>>>>>>>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>>>>>>>>>>> workarounds need to register handlers that return the appropriate
>>>>>>>>>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>>>>>>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>>>>>>>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>>>>>>>>>>> consider an IRQ coalesced if all other sinks either report
>>>>>>>>>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> No real devices are interested whether any of their output lines are
>>>>>>>>>>>>> even connected. This would introduce a new signal type, bidirectional
>>>>>>>>>>>>> multi-level, which is not correct.
>>>>>>>>>>>>>
>>>>>>>>>>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>>>>>>>>>>> to a suggestion that we ought to introduce a new signal type for this
>>>>>>>>>>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>>>>>>>>>>> similar interface as qemu_irq.
>>>>>>>>>>> A separate type would complicate the delivery of the feedback value
>>>>>>>>>>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>>>>>>>>>>
>>>>>>>>>>>>> I think the real solution to coalescing is put the logic inside one
>>>>>>>>>>>>> device, in this case APIC because it has the information about irq
>>>>>>>>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>>>>>>>>>> information and whether they get delivered or not. If not, an internal
>>>>>>>>>>>>> timer is installed which injects the lost irqs.
>>>>>>>>>>> That won't fly as the IRQs will already arrive at the APIC with a
>>>>>>>>>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>>>>>>>>>> interrupt controller about the fact that a particular IRQ should be
>>>>>>>>>>> delivered at a specific regular rate. For this, you also need a generic
>>>>>>>>>>> interface - nothing really "won".
>>>>>>>>>> OK, let's simplify: just reinject at next possible chance. No need to
>>>>>>>>>> monitor or tell anything.
>>>>>>>>> There are guests that won't like this (I know of one in-house, but
>>>>>>>>> others may even have more examples), specifically if you end up firing
>>>>>>>>> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>>>>>>>>> spreads the reinjection according to the current rate.
>>>>>>>> Then reinject with a constant delay, or next CPU exit. Such buggy
>>>>>>> If guest's time frequency is the same as host time frequency you can't
>>>>>>> reinject with constant delay. That is why current code mixes two
>>>>>>> approaches: reinject M interrupts in a raw then delay.
>>>>>> This approach can be also used by APIC-only version.
>>>>>>
>>>>> I don't know what APIC-only version you are talking about. I haven't
>>>>> seen the code and I don't understand hand waving, sorry.
>>>> There is no code, because we're still at architecture design stage.
>>>>
>>> Try to write test code to understand the problem better.
>>>
>>>>>>>> guests could also be assisted with special handling (like win2k
>>>>>>>> install hack), for example guest instructions could be counted
>>>>>>>> (approximately, for example using TB size or TSC) and only inject
>>>>>>>> after at least N instructions have passed.
>>>>>>> Guest instructions cannot be easily counted in KVM (it can be done more
>>>>>>> or less reliably using perf counters, may be).
>>>>>> Aren't there any debug registers or perf counters, which can generate
>>>>>> an interrupt after some number of instructions have been executed?
>>>>> Don't think debug registers have something like that and they are
>>>>> available for guest use anyway. Perf counters differs greatly from CPU
>>>>> to CPU (even between two CPUs of the same manufacturer), and we want to
>>>>> keep using them for profiling guests. And I don't see what problem it
>>>>> will solve anyway that can be solved by simple delay between irq
>>>>> reinjection.
>>>> This would allow counting the executed instructions and limit it. Thus
>>>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>>>
>>> Why would you want to limit number of instruction executed by guest if
>>> CPU has nothing else to do anyway? The problem occurs not when we have
>>> spare cycles so give to a guest, but in opposite case.
>>>
>>>>>>>>> And even if the rate did not matter, the APIC woult still have to now
>>>>>>>>> about the fact that an IRQ is really periodic and does not only appear
>>>>>>>>> as such for a certain interval. This really does not sound like
>>>>>>>>> simplifying things or even make them cleaner.
>>>>>>>> It would, the voodoo would be contained only in APIC, RTC would be
>>>>>>>> just like any other device. With the bidirectional irqs, this voodoo
>>>>>>>> would probably eventually spread to many other devices. The logical
>>>>>>>> conclusion of that would be a system where all devices would be
>>>>>>>> careful not to disturb the guest at wrong moment because that would
>>>>>>>> trigger a bug.
>>>>>>>>
>>>>>>> This voodoo will be so complex and unreliable that it will make RTC hack
>>>>>>> pale in comparison (and I still don't see how you are going to make it
>>>>>>> actually work).
>>>>>> Implement everything inside APIC: only coalescing and reinjection.
>>>>> APIC has zero info needed to implement reinjection correctly as was
>>>>> shown to you several time in this thread and you simply keep ignoring
>>>>> it.
>>>> On the contrary, APIC is actually the only source of the IRQ ack
>>>> information. RTC hack would not work without APIC (or the
>>>> bidirectional IRQ) passing this info to RTC.
>>>>
>>>> What APIC doesn't have now is the timer frequency or period info. This
>>>> is known by RTC and also higher levels managing the clocks.
>>>>
>>> So APIC has one bit of information and RTC everything else. The current
>>> approach (and proposed patch) brings this one bit of information to RTC,
>>> you are arguing that RTC should be able to communicate all its info to
>>> APIC. Sorry I don't see that your way has any advantage. Just more
>>> complex interface and it is much easier to get it wrong for other time
>>> sources.
>>>
>>>> I keep ignoring the idea that the current model, where both RTC and
>>>> APIC must somehow work together to make coalescing work, is the only
>>>> possible just because it is committed and it happens to work in some
>>>> cases. It would be much better to concentrate this to one place, APIC
>>>> or preferably higher level where it may benefit other timers too.
>>>> Provided of course that the other models can be made to work.
>>>>
>>> So write the code and show us. You haven't show any evidence that RTC is
>>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>>> know when clock frequency changes, it know when device reset happened.
>>> APIC knows only that interrupt was coalesced. It doesn't even know that
>>> it may be masked by a guest in IOAPIC (interrupts delivered while they
>>> are masked not considered coalesced). Time source knows only when
>>> frequency changes and may be when device reset happens if timer is
>>> stopped by device on reset. So RTC is actually a sweet spot if you want
>>> to minimize amount of info you need to pass between various layers.
>> This is - according to my current understanding - the proposed
>> alternative architecture:
>>
>>                                          .---------------.
>>                                          | de-coalescing |
>>                                          |     logic     |
>>                                          '---------------'
>>                                                ^   |
>>                                         period,|   |IRQ
>>                                       coalesced|   |(or tuned VM clock)
>>                                        (yes/no)|   v
>> .-------.              .--------.             .-------.
>> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>> '-------'              '--------'             '-------'
>>    |                    ^    |                   ^
>>    |                    |    |                   |
>>    '-------period-------'    '------period-------'
> 
> The period information is already known by the higher level clock
> management,

The timer device models program the next event of some qemu-timer. There
is no tag attached explaining the timer subsystem or anyone else the
reason behind this programming.

> it should be available (or made available) to
> de-coalescing logic (which should probably be located close to other
> clock stuff) but otherwise there shouldn't be a need to pass it
> around. The tuned VM clock of course only affects the RTC. Otherwise
> correct.
> 
>> And here is what this patch implements (except for the not yet factored
>> out de-coalescing logic):
>>
>>   .---------------.
>>   | de-coalescing |
>>   |     logic     |
>>   '---------------'
>>         ^   |
>>  period,|   |next IRQ date
>> coalesced|   |(or tuned VM clock)
>>  (yes/no)|   v
>>       .-------.              .--------.             .-------.
>>       |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>       '-------'              '--------'             '-------'
>>           ^                    |    ^                   |
>>           |                    |    |                   |
>>           '------coalesced-----'    '-----coalesced-----'
> 
> Why not route the coalesced signal directly from APIC to de-coalescing
> logic? Then our designs would match!

Because the de-coalescing logic will not be able to associate it with
the period information delivered by the RTC (without apply ugly hacks).

> 
>> I still don't see how the alternative is supposed to simplify our life
>> or improve the efficiency of the de-coalescing workaround. It's rather
>> problematic like Gleb pointed out: The de-coalescing logic needs to be
>> informed about periodicity changes that can only be delivered along
>> IRQs. So what to do with the backlog when the timer is stopped?
> 
> What happens with the current design? Gleb only mentioned the
> frequency change, I thought that was not so big problem. But I don't
> think this case should be allowed happen at all, it can't exist on
> real HW.
> 
>> Regarding an improved de-coalescing logic: As its final design is widely
>> independent of how we collect the information, it could perfectly be
>> done after we laid the elementary foundation.
> 
> True. But what is the foundation, do we need bidirectional IRQs or
> not? I still think current IRQs should be enough.

Then explain how you want to do the association between IRQ injection
and IRQ period. Both of my diagrams require an additional channel
between source and sink, either forward (tag IRQ event with its period)
or backward (report injection result - aka feedback IRQs). We currently
don't have such channel but the APIC hack (apic_reset/get_irq_delivered).

Jan
Blue Swirl - May 29, 2010, 10:04 a.m.
On Sat, May 29, 2010 at 9:45 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> 2010/5/29 Jan Kiszka <jan.kiszka@web.de>:
>>> Gleb Natapov wrote:
>>>> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>>>>> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
>>>>>> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>>>>>>> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>>>>>>>> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>>>>>>>>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>>> Blue Swirl wrote:
>>>>>>>>>>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>>>>> Anthony Liguori wrote:
>>>>>>>>>>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>>>>>>>>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>>>>>>>>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>>>>>>>>>>>> workarounds need to register handlers that return the appropriate
>>>>>>>>>>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>>>>>>>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>>>>>>>>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>>>>>>>>>>>> consider an IRQ coalesced if all other sinks either report
>>>>>>>>>>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> No real devices are interested whether any of their output lines are
>>>>>>>>>>>>>> even connected. This would introduce a new signal type, bidirectional
>>>>>>>>>>>>>> multi-level, which is not correct.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>>>>>>>>>>>> to a suggestion that we ought to introduce a new signal type for this
>>>>>>>>>>>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>>>>>>>>>>>> similar interface as qemu_irq.
>>>>>>>>>>>> A separate type would complicate the delivery of the feedback value
>>>>>>>>>>>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>>>>>>>>>>>
>>>>>>>>>>>>>> I think the real solution to coalescing is put the logic inside one
>>>>>>>>>>>>>> device, in this case APIC because it has the information about irq
>>>>>>>>>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>>>>>>>>>>> information and whether they get delivered or not. If not, an internal
>>>>>>>>>>>>>> timer is installed which injects the lost irqs.
>>>>>>>>>>>> That won't fly as the IRQs will already arrive at the APIC with a
>>>>>>>>>>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>>>>>>>>>>> interrupt controller about the fact that a particular IRQ should be
>>>>>>>>>>>> delivered at a specific regular rate. For this, you also need a generic
>>>>>>>>>>>> interface - nothing really "won".
>>>>>>>>>>> OK, let's simplify: just reinject at next possible chance. No need to
>>>>>>>>>>> monitor or tell anything.
>>>>>>>>>> There are guests that won't like this (I know of one in-house, but
>>>>>>>>>> others may even have more examples), specifically if you end up firing
>>>>>>>>>> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>>>>>>>>>> spreads the reinjection according to the current rate.
>>>>>>>>> Then reinject with a constant delay, or next CPU exit. Such buggy
>>>>>>>> If guest's time frequency is the same as host time frequency you can't
>>>>>>>> reinject with constant delay. That is why current code mixes two
>>>>>>>> approaches: reinject M interrupts in a raw then delay.
>>>>>>> This approach can be also used by APIC-only version.
>>>>>>>
>>>>>> I don't know what APIC-only version you are talking about. I haven't
>>>>>> seen the code and I don't understand hand waving, sorry.
>>>>> There is no code, because we're still at architecture design stage.
>>>>>
>>>> Try to write test code to understand the problem better.
>>>>
>>>>>>>>> guests could also be assisted with special handling (like win2k
>>>>>>>>> install hack), for example guest instructions could be counted
>>>>>>>>> (approximately, for example using TB size or TSC) and only inject
>>>>>>>>> after at least N instructions have passed.
>>>>>>>> Guest instructions cannot be easily counted in KVM (it can be done more
>>>>>>>> or less reliably using perf counters, may be).
>>>>>>> Aren't there any debug registers or perf counters, which can generate
>>>>>>> an interrupt after some number of instructions have been executed?
>>>>>> Don't think debug registers have something like that and they are
>>>>>> available for guest use anyway. Perf counters differs greatly from CPU
>>>>>> to CPU (even between two CPUs of the same manufacturer), and we want to
>>>>>> keep using them for profiling guests. And I don't see what problem it
>>>>>> will solve anyway that can be solved by simple delay between irq
>>>>>> reinjection.
>>>>> This would allow counting the executed instructions and limit it. Thus
>>>>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>>>>
>>>> Why would you want to limit number of instruction executed by guest if
>>>> CPU has nothing else to do anyway? The problem occurs not when we have
>>>> spare cycles so give to a guest, but in opposite case.
>>>>
>>>>>>>>>> And even if the rate did not matter, the APIC woult still have to now
>>>>>>>>>> about the fact that an IRQ is really periodic and does not only appear
>>>>>>>>>> as such for a certain interval. This really does not sound like
>>>>>>>>>> simplifying things or even make them cleaner.
>>>>>>>>> It would, the voodoo would be contained only in APIC, RTC would be
>>>>>>>>> just like any other device. With the bidirectional irqs, this voodoo
>>>>>>>>> would probably eventually spread to many other devices. The logical
>>>>>>>>> conclusion of that would be a system where all devices would be
>>>>>>>>> careful not to disturb the guest at wrong moment because that would
>>>>>>>>> trigger a bug.
>>>>>>>>>
>>>>>>>> This voodoo will be so complex and unreliable that it will make RTC hack
>>>>>>>> pale in comparison (and I still don't see how you are going to make it
>>>>>>>> actually work).
>>>>>>> Implement everything inside APIC: only coalescing and reinjection.
>>>>>> APIC has zero info needed to implement reinjection correctly as was
>>>>>> shown to you several time in this thread and you simply keep ignoring
>>>>>> it.
>>>>> On the contrary, APIC is actually the only source of the IRQ ack
>>>>> information. RTC hack would not work without APIC (or the
>>>>> bidirectional IRQ) passing this info to RTC.
>>>>>
>>>>> What APIC doesn't have now is the timer frequency or period info. This
>>>>> is known by RTC and also higher levels managing the clocks.
>>>>>
>>>> So APIC has one bit of information and RTC everything else. The current
>>>> approach (and proposed patch) brings this one bit of information to RTC,
>>>> you are arguing that RTC should be able to communicate all its info to
>>>> APIC. Sorry I don't see that your way has any advantage. Just more
>>>> complex interface and it is much easier to get it wrong for other time
>>>> sources.
>>>>
>>>>> I keep ignoring the idea that the current model, where both RTC and
>>>>> APIC must somehow work together to make coalescing work, is the only
>>>>> possible just because it is committed and it happens to work in some
>>>>> cases. It would be much better to concentrate this to one place, APIC
>>>>> or preferably higher level where it may benefit other timers too.
>>>>> Provided of course that the other models can be made to work.
>>>>>
>>>> So write the code and show us. You haven't show any evidence that RTC is
>>>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>>>> know when clock frequency changes, it know when device reset happened.
>>>> APIC knows only that interrupt was coalesced. It doesn't even know that
>>>> it may be masked by a guest in IOAPIC (interrupts delivered while they
>>>> are masked not considered coalesced). Time source knows only when
>>>> frequency changes and may be when device reset happens if timer is
>>>> stopped by device on reset. So RTC is actually a sweet spot if you want
>>>> to minimize amount of info you need to pass between various layers.
>>> This is - according to my current understanding - the proposed
>>> alternative architecture:
>>>
>>>                                          .---------------.
>>>                                          | de-coalescing |
>>>                                          |     logic     |
>>>                                          '---------------'
>>>                                                ^   |
>>>                                         period,|   |IRQ
>>>                                       coalesced|   |(or tuned VM clock)
>>>                                        (yes/no)|   v
>>> .-------.              .--------.             .-------.
>>> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>> '-------'              '--------'             '-------'
>>>    |                    ^    |                   ^
>>>    |                    |    |                   |
>>>    '-------period-------'    '------period-------'
>>
>> The period information is already known by the higher level clock
>> management,
>
> The timer device models program the next event of some qemu-timer. There
> is no tag attached explaining the timer subsystem or anyone else the
> reason behind this programming.

Yes, but why would we care? All timers in the system besides RTC
should be affected likewise, this would in fact be the benefit from
handling the problem at higher level.

>> it should be available (or made available) to
>> de-coalescing logic (which should probably be located close to other
>> clock stuff) but otherwise there shouldn't be a need to pass it
>> around. The tuned VM clock of course only affects the RTC. Otherwise
>> correct.
>>
>>> And here is what this patch implements (except for the not yet factored
>>> out de-coalescing logic):
>>>
>>>   .---------------.
>>>   | de-coalescing |
>>>   |     logic     |
>>>   '---------------'
>>>         ^   |
>>>  period,|   |next IRQ date
>>> coalesced|   |(or tuned VM clock)
>>>  (yes/no)|   v
>>>       .-------.              .--------.             .-------.
>>>       |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>>       '-------'              '--------'             '-------'
>>>           ^                    |    ^                   |
>>>           |                    |    |                   |
>>>           '------coalesced-----'    '-----coalesced-----'
>>
>> Why not route the coalesced signal directly from APIC to de-coalescing
>> logic? Then our designs would match!
>
> Because the de-coalescing logic will not be able to associate it with
> the period information delivered by the RTC (without apply ugly hacks).
>
>>
>>> I still don't see how the alternative is supposed to simplify our life
>>> or improve the efficiency of the de-coalescing workaround. It's rather
>>> problematic like Gleb pointed out: The de-coalescing logic needs to be
>>> informed about periodicity changes that can only be delivered along
>>> IRQs. So what to do with the backlog when the timer is stopped?
>>
>> What happens with the current design? Gleb only mentioned the
>> frequency change, I thought that was not so big problem. But I don't
>> think this case should be allowed happen at all, it can't exist on
>> real HW.
>>
>>> Regarding an improved de-coalescing logic: As its final design is widely
>>> independent of how we collect the information, it could perfectly be
>>> done after we laid the elementary foundation.
>>
>> True. But what is the foundation, do we need bidirectional IRQs or
>> not? I still think current IRQs should be enough.
>
> Then explain how you want to do the association between IRQ injection
> and IRQ period. Both of my diagrams require an additional channel
> between source and sink, either forward (tag IRQ event with its period)
> or backward (report injection result - aka feedback IRQs). We currently
> don't have such channel but the APIC hack (apic_reset/get_irq_delivered).
>
> Jan
>
>
Jan Kiszka - May 29, 2010, 10:16 a.m.
Blue Swirl wrote:
>>>> This is - according to my current understanding - the proposed
>>>> alternative architecture:
>>>>
>>>>                                          .---------------.
>>>>                                          | de-coalescing |
>>>>                                          |     logic     |
>>>>                                          '---------------'
>>>>                                                ^   |
>>>>                                         period,|   |IRQ
>>>>                                       coalesced|   |(or tuned VM clock)
>>>>                                        (yes/no)|   v
>>>> .-------.              .--------.             .-------.
>>>> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>>> '-------'              '--------'             '-------'
>>>>    |                    ^    |                   ^
>>>>    |                    |    |                   |
>>>>    '-------period-------'    '------period-------'
>>> The period information is already known by the higher level clock
>>> management,
>> The timer device models program the next event of some qemu-timer. There
>> is no tag attached explaining the timer subsystem or anyone else the
>> reason behind this programming.
> 
> Yes, but why would we care? All timers in the system besides RTC
> should be affected likewise, this would in fact be the benefit from
> handling the problem at higher level.

And how does your equation for calculating the clock slow-down look like?

Jan
Blue Swirl - May 29, 2010, 10:26 a.m.
On Sat, May 29, 2010 at 10:16 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>>>>> This is - according to my current understanding - the proposed
>>>>> alternative architecture:
>>>>>
>>>>>                                          .---------------.
>>>>>                                          | de-coalescing |
>>>>>                                          |     logic     |
>>>>>                                          '---------------'
>>>>>                                                ^   |
>>>>>                                         period,|   |IRQ
>>>>>                                       coalesced|   |(or tuned VM clock)
>>>>>                                        (yes/no)|   v
>>>>> .-------.              .--------.             .-------.
>>>>> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>>>> '-------'              '--------'             '-------'
>>>>>    |                    ^    |                   ^
>>>>>    |                    |    |                   |
>>>>>    '-------period-------'    '------period-------'
>>>> The period information is already known by the higher level clock
>>>> management,
>>> The timer device models program the next event of some qemu-timer. There
>>> is no tag attached explaining the timer subsystem or anyone else the
>>> reason behind this programming.
>>
>> Yes, but why would we care? All timers in the system besides RTC
>> should be affected likewise, this would in fact be the benefit from
>> handling the problem at higher level.
>
> And how does your equation for calculating the clock slow-down look like?

How about icount_adjust()?
Jan Kiszka - May 29, 2010, 10:38 a.m.
Blue Swirl wrote:
> On Sat, May 29, 2010 at 10:16 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>>>>> This is - according to my current understanding - the proposed
>>>>>> alternative architecture:
>>>>>>
>>>>>>                                          .---------------.
>>>>>>                                          | de-coalescing |
>>>>>>                                          |     logic     |
>>>>>>                                          '---------------'
>>>>>>                                                ^   |
>>>>>>                                         period,|   |IRQ
>>>>>>                                       coalesced|   |(or tuned VM clock)
>>>>>>                                        (yes/no)|   v
>>>>>> .-------.              .--------.             .-------.
>>>>>> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>>>>> '-------'              '--------'             '-------'
>>>>>>    |                    ^    |                   ^
>>>>>>    |                    |    |                   |
>>>>>>    '-------period-------'    '------period-------'
>>>>> The period information is already known by the higher level clock
>>>>> management,
>>>> The timer device models program the next event of some qemu-timer. There
>>>> is no tag attached explaining the timer subsystem or anyone else the
>>>> reason behind this programming.
>>> Yes, but why would we care? All timers in the system besides RTC
>>> should be affected likewise, this would in fact be the benefit from
>>> handling the problem at higher level.
>> And how does your equation for calculating the clock slow-down look like?
> 
> How about icount_adjust()?

I would suggest that you implement your ideas now. Please keep us
informed about the progress as this series (and more) depends on a decision.

Jan
Gleb Natapov - May 29, 2010, 2:38 p.m.
On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
> >> There is no code, because we're still at architecture design stage.
> >>
> > Try to write test code to understand the problem better.
> 
> I will.
> 
Please do ASAP. This discussion shows that you don't understand what is the
problem that we are dialing with.

> >> >> >> guests could also be assisted with special handling (like win2k
> >> >> >> install hack), for example guest instructions could be counted
> >> >> >> (approximately, for example using TB size or TSC) and only inject
> >> >> >> after at least N instructions have passed.
> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
> >> >> > or less reliably using perf counters, may be).
> >> >>
> >> >> Aren't there any debug registers or perf counters, which can generate
> >> >> an interrupt after some number of instructions have been executed?
> >> > Don't think debug registers have something like that and they are
> >> > available for guest use anyway. Perf counters differs greatly from CPU
> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
> >> > keep using them for profiling guests. And I don't see what problem it
> >> > will solve anyway that can be solved by simple delay between irq
> >> > reinjection.
> >>
> >> This would allow counting the executed instructions and limit it. Thus
> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
> >>
> > Why would you want to limit number of instruction executed by guest if
> > CPU has nothing else to do anyway? The problem occurs not when we have
> > spare cycles so give to a guest, but in opposite case.
> 
> I think one problem is that the guest has executed too much compared
> to what would happen with real HW with a lesser CPU. That explains the
> RTC frequency reprogramming case.
You think wrong. The problem is exactly opposite: the guest haven't
had enough execution time between two time interrupts. I don't know what
RTC frequency reprogramming case you are talking about here.

> 
> >
> >> >>
> >> >> >>
> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
> >> >> >> > as such for a certain interval. This really does not sound like
> >> >> >> > simplifying things or even make them cleaner.
> >> >> >>
> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
> >> >> >> would probably eventually spread to many other devices. The logical
> >> >> >> conclusion of that would be a system where all devices would be
> >> >> >> careful not to disturb the guest at wrong moment because that would
> >> >> >> trigger a bug.
> >> >> >>
> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
> >> >> > pale in comparison (and I still don't see how you are going to make it
> >> >> > actually work).
> >> >>
> >> >> Implement everything inside APIC: only coalescing and reinjection.
> >> > APIC has zero info needed to implement reinjection correctly as was
> >> > shown to you several time in this thread and you simply keep ignoring
> >> > it.
> >>
> >> On the contrary, APIC is actually the only source of the IRQ ack
> >> information. RTC hack would not work without APIC (or the
> >> bidirectional IRQ) passing this info to RTC.
> >>
> >> What APIC doesn't have now is the timer frequency or period info. This
> >> is known by RTC and also higher levels managing the clocks.
> >>
> > So APIC has one bit of information and RTC everything else.
> 
> The information known by RTC (timer period) is also known by higher levels.
> 
What do you mean by higher level here? vl.c or APIC.

> > The current
> > approach (and proposed patch) brings this one bit of information to RTC,
> > you are arguing that RTC should be able to communicate all its info to
> > APIC. Sorry I don't see that your way has any advantage. Just more
> > complex interface and it is much easier to get it wrong for other time
> > sources.
> 
> I don't think anymore that APIC should be handling this but the
> generic stuff, like vl.c or exec.c. Then there would be only
> information passing from APIC to higher levels.
Handling reinjection by general timer code makes infinitely more sense
then handling it in APIC. One thing (from the top of my head) that can't
be implemented at that level is injection of interrupt back to back (i.e
injecting next interrupt immediately after guest acknowledge previous
one to RTC).

> 
> >> I keep ignoring the idea that the current model, where both RTC and
> >> APIC must somehow work together to make coalescing work, is the only
> >> possible just because it is committed and it happens to work in some
> >> cases. It would be much better to concentrate this to one place, APIC
> >> or preferably higher level where it may benefit other timers too.
> >> Provided of course that the other models can be made to work.
> >>
> > So write the code and show us. You haven't show any evidence that RTC is
> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> > know when clock frequency changes, it know when device reset happened.
> > APIC knows only that interrupt was coalesced. It doesn't even know that
> > it may be masked by a guest in IOAPIC (interrupts delivered while they
> > are masked not considered coalesced).
> 
> Oh, I thought interrupt masking was the reason for coalescing! What
> exactly is the reason then?
> 
The reason is that guest has no time to process previous interrupt
before it is time to inject next one.

> > Time source knows only when
> > frequency changes and may be when device reset happens if timer is
> > stopped by device on reset. So RTC is actually a sweet spot if you want
> > to minimize amount of info you need to pass between various layers.
> >
> >> >> Maybe that version would not bend backwards as much as the current to
> >> >> cater for buggy hosts.
> >> >>
> >> > You mean "buggy guests"?
> >>
> >> Yes, sorry.
> >>
> >> > What guests are not buggy in your opinion?
> >> > Linux tries hard to be smart and as a result the only way to have stable
> >> > clock with it is to go paravirt.
> >>
> >> I'm not an OS designer, but I think an OS should never crash, even if
> >> a burst of IRQs is received. Reprogramming the timer should consider
> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
> >> cause of the problem.
> > OS should never crash in the absence of HW bugs? I doubt you can design
> > an OS that can run in a face of any HW failure. Anyway here we are
> > trying to solve guests time keeping problem not crashes. Do you think
> > you can design OS that can keep time accurately no matter how crazy all
> > HW clock behaves?
> 
> I think my OS design skills are not relevant in this discussion, but
> IIRC there are fault tolerant operating systems for extreme conditions
> so it can be done.
> 
> >
> >>
> >> >> > The fact is that timer device is not "just like any
> >> >> > other device" in virtual world. Any other device is easy: you just
> >> >> > implement spec as close as possible and everything works. For time
> >> >> > source device this is not enough. You can implement RTC+HPET to the
> >> >> > letter and your guest will drift like crazy.
> >> >>
> >> >> It's doable: a cycle accurate emulator will not cause any drift,
> >> >> without any voodoo. The interrupts would come after executing the same
> >> >> instruction as the real HW. For emulating any sufficiently buggy
> >> >> guests in any sufficiently desperate low resource conditions, this may
> >> >> be the only option that will always work.
> >> >>
> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
> >> > time, so any emulation done between two instruction is theoretically
> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
> >> > keeping all time source in perfect sync is also impossible.
> >>
> >> That is actually another cause of the problem. KVM gives the guest an
> >> illusion that the VCPU speed is equal to host speed. When they don't
> >> match, especially in critical code, there can be problems. It would be
> >> better to tell the guest a lower speed, which also can be guaranteed.
> >>
> > Not possible. It's that simple. You should take it into account in your
> > architecture design stage. In case of KVM real physical CPU executes guest
> > instruction and it does this as fast as it can. The only way we can hide
> > that from a guest is by intercepting each access to TSC and at that
> > point we can use bochs instead.
> 
> Well, as Paul pointed out, there's also icount option.
> 
icount is not an option for KVM.

> >> Maybe we should also offline the device emulation to another host CPU
> >> with threading. A load from a device will always be much slower than
> >> on real HW though.
> > Time drift problem start to happen on loaded servers, so you do not have
> > spare CPU to offload device emulation too.
> >
> > --
> >                        Gleb.
> >

--
			Gleb.
Gleb Natapov - May 29, 2010, 2:46 p.m.
On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
> > I still don't see how the alternative is supposed to simplify our life
> > or improve the efficiency of the de-coalescing workaround. It's rather
> > problematic like Gleb pointed out: The de-coalescing logic needs to be
> > informed about periodicity changes that can only be delivered along
> > IRQs. So what to do with the backlog when the timer is stopped?
> 
> What happens with the current design? Gleb only mentioned the
> frequency change, I thought that was not so big problem. But I don't
> think this case should be allowed happen at all, it can't exist on
> real HW.
> 
Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
inside QEMU, connect with gdb to QEMU process and check what frequency
RTC configured with (hint: it will be 64Hz), now run video inside the
guest and check frequency again (hint: it will be 1Khz).

--
			Gleb.
Blue Swirl - May 29, 2010, 4:03 p.m.
2010/5/29 Gleb Natapov <gleb@redhat.com>:
> On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
>> >> There is no code, because we're still at architecture design stage.
>> >>
>> > Try to write test code to understand the problem better.
>>
>> I will.
>>
> Please do ASAP. This discussion shows that you don't understand what is the
> problem that we are dialing with.

Which part of the problem you think I don't understand?

>> >> >> >> guests could also be assisted with special handling (like win2k
>> >> >> >> install hack), for example guest instructions could be counted
>> >> >> >> (approximately, for example using TB size or TSC) and only inject
>> >> >> >> after at least N instructions have passed.
>> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
>> >> >> > or less reliably using perf counters, may be).
>> >> >>
>> >> >> Aren't there any debug registers or perf counters, which can generate
>> >> >> an interrupt after some number of instructions have been executed?
>> >> > Don't think debug registers have something like that and they are
>> >> > available for guest use anyway. Perf counters differs greatly from CPU
>> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
>> >> > keep using them for profiling guests. And I don't see what problem it
>> >> > will solve anyway that can be solved by simple delay between irq
>> >> > reinjection.
>> >>
>> >> This would allow counting the executed instructions and limit it. Thus
>> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>> >>
>> > Why would you want to limit number of instruction executed by guest if
>> > CPU has nothing else to do anyway? The problem occurs not when we have
>> > spare cycles so give to a guest, but in opposite case.
>>
>> I think one problem is that the guest has executed too much compared
>> to what would happen with real HW with a lesser CPU. That explains the
>> RTC frequency reprogramming case.
> You think wrong. The problem is exactly opposite: the guest haven't
> had enough execution time between two time interrupts. I don't know what
> RTC frequency reprogramming case you are talking about here.

The case you told me where N pending tick IRQs exist but the guest
wants to change the RTC frequency from 64Hz to 1024Hz.

Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
the frequency to 1000Hz.

The problem for emulation is that for the same 3 ticks, there has been
so little execution power that the ticks have been coalesced. But
isn't the guest cycle count then much lower than 30Mcyc?

Isn't it so that the guest must be above 30Mcyc to be able to want the
change? But if we reach that point,  the problem must have not been
too little execution time, but too much.

>
>>
>> >
>> >> >>
>> >> >> >>
>> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
>> >> >> >> > as such for a certain interval. This really does not sound like
>> >> >> >> > simplifying things or even make them cleaner.
>> >> >> >>
>> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> >> >> would probably eventually spread to many other devices. The logical
>> >> >> >> conclusion of that would be a system where all devices would be
>> >> >> >> careful not to disturb the guest at wrong moment because that would
>> >> >> >> trigger a bug.
>> >> >> >>
>> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
>> >> >> > pale in comparison (and I still don't see how you are going to make it
>> >> >> > actually work).
>> >> >>
>> >> >> Implement everything inside APIC: only coalescing and reinjection.
>> >> > APIC has zero info needed to implement reinjection correctly as was
>> >> > shown to you several time in this thread and you simply keep ignoring
>> >> > it.
>> >>
>> >> On the contrary, APIC is actually the only source of the IRQ ack
>> >> information. RTC hack would not work without APIC (or the
>> >> bidirectional IRQ) passing this info to RTC.
>> >>
>> >> What APIC doesn't have now is the timer frequency or period info. This
>> >> is known by RTC and also higher levels managing the clocks.
>> >>
>> > So APIC has one bit of information and RTC everything else.
>>
>> The information known by RTC (timer period) is also known by higher levels.
>>
> What do you mean by higher level here? vl.c or APIC.

vl.c, qemu-timer.c.

>> > The current
>> > approach (and proposed patch) brings this one bit of information to RTC,
>> > you are arguing that RTC should be able to communicate all its info to
>> > APIC. Sorry I don't see that your way has any advantage. Just more
>> > complex interface and it is much easier to get it wrong for other time
>> > sources.
>>
>> I don't think anymore that APIC should be handling this but the
>> generic stuff, like vl.c or exec.c. Then there would be only
>> information passing from APIC to higher levels.
> Handling reinjection by general timer code makes infinitely more sense
> then handling it in APIC.

I'm glad you agree, or did you mean 'less'?

> One thing (from the top of my head) that can't
> be implemented at that level is injection of interrupt back to back (i.e
> injecting next interrupt immediately after guest acknowledge previous
> one to RTC).

But Jan told this confuses some buggy OSes.

>
>>
>> >> I keep ignoring the idea that the current model, where both RTC and
>> >> APIC must somehow work together to make coalescing work, is the only
>> >> possible just because it is committed and it happens to work in some
>> >> cases. It would be much better to concentrate this to one place, APIC
>> >> or preferably higher level where it may benefit other timers too.
>> >> Provided of course that the other models can be made to work.
>> >>
>> > So write the code and show us. You haven't show any evidence that RTC is
>> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> > know when clock frequency changes, it know when device reset happened.
>> > APIC knows only that interrupt was coalesced. It doesn't even know that
>> > it may be masked by a guest in IOAPIC (interrupts delivered while they
>> > are masked not considered coalesced).
>>
>> Oh, I thought interrupt masking was the reason for coalescing! What
>> exactly is the reason then?
>>
> The reason is that guest has no time to process previous interrupt
> before it is time to inject next one.

Because of other host load or other emulation done by the same QEMU
process, I suppose?

>> > Time source knows only when
>> > frequency changes and may be when device reset happens if timer is
>> > stopped by device on reset. So RTC is actually a sweet spot if you want
>> > to minimize amount of info you need to pass between various layers.
>> >
>> >> >> Maybe that version would not bend backwards as much as the current to
>> >> >> cater for buggy hosts.
>> >> >>
>> >> > You mean "buggy guests"?
>> >>
>> >> Yes, sorry.
>> >>
>> >> > What guests are not buggy in your opinion?
>> >> > Linux tries hard to be smart and as a result the only way to have stable
>> >> > clock with it is to go paravirt.
>> >>
>> >> I'm not an OS designer, but I think an OS should never crash, even if
>> >> a burst of IRQs is received. Reprogramming the timer should consider
>> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>> >> cause of the problem.
>> > OS should never crash in the absence of HW bugs? I doubt you can design
>> > an OS that can run in a face of any HW failure. Anyway here we are
>> > trying to solve guests time keeping problem not crashes. Do you think
>> > you can design OS that can keep time accurately no matter how crazy all
>> > HW clock behaves?
>>
>> I think my OS design skills are not relevant in this discussion, but
>> IIRC there are fault tolerant operating systems for extreme conditions
>> so it can be done.
>>
>> >
>> >>
>> >> >> > The fact is that timer device is not "just like any
>> >> >> > other device" in virtual world. Any other device is easy: you just
>> >> >> > implement spec as close as possible and everything works. For time
>> >> >> > source device this is not enough. You can implement RTC+HPET to the
>> >> >> > letter and your guest will drift like crazy.
>> >> >>
>> >> >> It's doable: a cycle accurate emulator will not cause any drift,
>> >> >> without any voodoo. The interrupts would come after executing the same
>> >> >> instruction as the real HW. For emulating any sufficiently buggy
>> >> >> guests in any sufficiently desperate low resource conditions, this may
>> >> >> be the only option that will always work.
>> >> >>
>> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
>> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
>> >> > time, so any emulation done between two instruction is theoretically
>> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
>> >> > keeping all time source in perfect sync is also impossible.
>> >>
>> >> That is actually another cause of the problem. KVM gives the guest an
>> >> illusion that the VCPU speed is equal to host speed. When they don't
>> >> match, especially in critical code, there can be problems. It would be
>> >> better to tell the guest a lower speed, which also can be guaranteed.
>> >>
>> > Not possible. It's that simple. You should take it into account in your
>> > architecture design stage. In case of KVM real physical CPU executes guest
>> > instruction and it does this as fast as it can. The only way we can hide
>> > that from a guest is by intercepting each access to TSC and at that
>> > point we can use bochs instead.
>>
>> Well, as Paul pointed out, there's also icount option.
>>
> icount is not an option for KVM.

I think icount timer adjustment model might make sense for this work
too. We'd then just need some figure of executed CPU instructions, TSC
cycles or even kernel scheduler time slice information (how much time
the process got).

>
>> >> Maybe we should also offline the device emulation to another host CPU
>> >> with threading. A load from a device will always be much slower than
>> >> on real HW though.
>> > Time drift problem start to happen on loaded servers, so you do not have
>> > spare CPU to offload device emulation too.
>> >
>> > --
>> >                        Gleb.
>> >
>
> --
>                        Gleb.
>
Blue Swirl - May 29, 2010, 4:13 p.m.
On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
>> > I still don't see how the alternative is supposed to simplify our life
>> > or improve the efficiency of the de-coalescing workaround. It's rather
>> > problematic like Gleb pointed out: The de-coalescing logic needs to be
>> > informed about periodicity changes that can only be delivered along
>> > IRQs. So what to do with the backlog when the timer is stopped?
>>
>> What happens with the current design? Gleb only mentioned the
>> frequency change, I thought that was not so big problem. But I don't
>> think this case should be allowed happen at all, it can't exist on
>> real HW.
>>
> Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
> inside QEMU, connect with gdb to QEMU process and check what frequency
> RTC configured with (hint: it will be 64Hz), now run video inside the
> guest and check frequency again (hint: it will be 1Khz).

You missed the key word 'stopped'. If the timer is really stopped, no
IRQs should ever come out afterwards, just like on real HW. For the
emulation, this means loss of ticks which should have been delivered
before the change.

But what if the guest changed the frequency very often, and between
changes used zero value, like 64Hz -> 0Hz -> 128Hz -> 0Hz -> 64Hz?
Gleb Natapov - May 29, 2010, 4:32 p.m.
On Sat, May 29, 2010 at 04:03:22PM +0000, Blue Swirl wrote:
> 2010/5/29 Gleb Natapov <gleb@redhat.com>:
> > On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
> >> >> There is no code, because we're still at architecture design stage.
> >> >>
> >> > Try to write test code to understand the problem better.
> >>
> >> I will.
> >>
> > Please do ASAP. This discussion shows that you don't understand what is the
> > problem that we are dialing with.
> 
> Which part of the problem you think I don't understand?
> 
It seams to me you don't understand how Windows uses RTC for time
keeping and how the QEMU solves the problem today.

> >> >> >> >> guests could also be assisted with special handling (like win2k
> >> >> >> >> install hack), for example guest instructions could be counted
> >> >> >> >> (approximately, for example using TB size or TSC) and only inject
> >> >> >> >> after at least N instructions have passed.
> >> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
> >> >> >> > or less reliably using perf counters, may be).
> >> >> >>
> >> >> >> Aren't there any debug registers or perf counters, which can generate
> >> >> >> an interrupt after some number of instructions have been executed?
> >> >> > Don't think debug registers have something like that and they are
> >> >> > available for guest use anyway. Perf counters differs greatly from CPU
> >> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
> >> >> > keep using them for profiling guests. And I don't see what problem it
> >> >> > will solve anyway that can be solved by simple delay between irq
> >> >> > reinjection.
> >> >>
> >> >> This would allow counting the executed instructions and limit it. Thus
> >> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
> >> >>
> >> > Why would you want to limit number of instruction executed by guest if
> >> > CPU has nothing else to do anyway? The problem occurs not when we have
> >> > spare cycles so give to a guest, but in opposite case.
> >>
> >> I think one problem is that the guest has executed too much compared
> >> to what would happen with real HW with a lesser CPU. That explains the
> >> RTC frequency reprogramming case.
> > You think wrong. The problem is exactly opposite: the guest haven't
> > had enough execution time between two time interrupts. I don't know what
> > RTC frequency reprogramming case you are talking about here.
> 
> The case you told me where N pending tick IRQs exist but the guest
> wants to change the RTC frequency from 64Hz to 1024Hz.
> 
> Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
> 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
> the frequency to 1000Hz.
> 
> The problem for emulation is that for the same 3 ticks, there has been
> so little execution power that the ticks have been coalesced. But
> isn't the guest cycle count then much lower than 30Mcyc?
> 
> Isn't it so that the guest must be above 30Mcyc to be able to want the
> change? But if we reach that point,  the problem must have not been
> too little execution time, but too much.
> 
Sorry I tried hard to understand what have you said above but failed.
What do you mean "to be able to want the change"? Guest sometimes wants
to get 64 timer interrupts per second and sometimes it wants to get 1024
timer interrupt per second. It wants it not as a result of time drift or
anything. It's just how guest behaves. You seams to be to fixated on
guest frequency change. It's just something you have to take into
account when you reinject interrupts.


> >
> >>
> >> >
> >> >> >>
> >> >> >> >>
> >> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
> >> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
> >> >> >> >> > as such for a certain interval. This really does not sound like
> >> >> >> >> > simplifying things or even make them cleaner.
> >> >> >> >>
> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
> >> >> >> >> would probably eventually spread to many other devices. The logical
> >> >> >> >> conclusion of that would be a system where all devices would be
> >> >> >> >> careful not to disturb the guest at wrong moment because that would
> >> >> >> >> trigger a bug.
> >> >> >> >>
> >> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
> >> >> >> > pale in comparison (and I still don't see how you are going to make it
> >> >> >> > actually work).
> >> >> >>
> >> >> >> Implement everything inside APIC: only coalescing and reinjection.
> >> >> > APIC has zero info needed to implement reinjection correctly as was
> >> >> > shown to you several time in this thread and you simply keep ignoring
> >> >> > it.
> >> >>
> >> >> On the contrary, APIC is actually the only source of the IRQ ack
> >> >> information. RTC hack would not work without APIC (or the
> >> >> bidirectional IRQ) passing this info to RTC.
> >> >>
> >> >> What APIC doesn't have now is the timer frequency or period info. This
> >> >> is known by RTC and also higher levels managing the clocks.
> >> >>
> >> > So APIC has one bit of information and RTC everything else.
> >>
> >> The information known by RTC (timer period) is also known by higher levels.
> >>
> > What do you mean by higher level here? vl.c or APIC.
> 
> vl.c, qemu-timer.c.
> 
> >> > The current
> >> > approach (and proposed patch) brings this one bit of information to RTC,
> >> > you are arguing that RTC should be able to communicate all its info to
> >> > APIC. Sorry I don't see that your way has any advantage. Just more
> >> > complex interface and it is much easier to get it wrong for other time
> >> > sources.
> >>
> >> I don't think anymore that APIC should be handling this but the
> >> generic stuff, like vl.c or exec.c. Then there would be only
> >> information passing from APIC to higher levels.
> > Handling reinjection by general timer code makes infinitely more sense
> > then handling it in APIC.
> 
> I'm glad you agree, or did you mean 'less'?
> 
Compared to APIC I would agree that even putting it in IDE is better idea :)

> > One thing (from the top of my head) that can't
> > be implemented at that level is injection of interrupt back to back (i.e
> > injecting next interrupt immediately after guest acknowledge previous
> > one to RTC).
> 
> But Jan told this confuses some buggy OSes.
> 
You keep calling them buggy, but I don't agree. They are written with
certain assumption that are true on real HW, but hard to achieve on
virtual. Anyway we use this technique (back to back reinjection)
otherwise you can't solve drift problem if guest want to receive
timer interrupts with max frequency that host time source support.


> >
> >>
> >> >> I keep ignoring the idea that the current model, where both RTC and
> >> >> APIC must somehow work together to make coalescing work, is the only
> >> >> possible just because it is committed and it happens to work in some
> >> >> cases. It would be much better to concentrate this to one place, APIC
> >> >> or preferably higher level where it may benefit other timers too.
> >> >> Provided of course that the other models can be made to work.
> >> >>
> >> > So write the code and show us. You haven't show any evidence that RTC is
> >> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> >> > know when clock frequency changes, it know when device reset happened.
> >> > APIC knows only that interrupt was coalesced. It doesn't even know that
> >> > it may be masked by a guest in IOAPIC (interrupts delivered while they
> >> > are masked not considered coalesced).
> >>
> >> Oh, I thought interrupt masking was the reason for coalescing! What
> >> exactly is the reason then?
> >>
> > The reason is that guest has no time to process previous interrupt
> > before it is time to inject next one.
> 
> Because of other host load or other emulation done by the same QEMU
> process, I suppose?
Yes, both.

> 
> >> > Time source knows only when
> >> > frequency changes and may be when device reset happens if timer is
> >> > stopped by device on reset. So RTC is actually a sweet spot if you want
> >> > to minimize amount of info you need to pass between various layers.
> >> >
> >> >> >> Maybe that version would not bend backwards as much as the current to
> >> >> >> cater for buggy hosts.
> >> >> >>
> >> >> > You mean "buggy guests"?
> >> >>
> >> >> Yes, sorry.
> >> >>
> >> >> > What guests are not buggy in your opinion?
> >> >> > Linux tries hard to be smart and as a result the only way to have stable
> >> >> > clock with it is to go paravirt.
> >> >>
> >> >> I'm not an OS designer, but I think an OS should never crash, even if
> >> >> a burst of IRQs is received. Reprogramming the timer should consider
> >> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
> >> >> cause of the problem.
> >> > OS should never crash in the absence of HW bugs? I doubt you can design
> >> > an OS that can run in a face of any HW failure. Anyway here we are
> >> > trying to solve guests time keeping problem not crashes. Do you think
> >> > you can design OS that can keep time accurately no matter how crazy all
> >> > HW clock behaves?
> >>
> >> I think my OS design skills are not relevant in this discussion, but
> >> IIRC there are fault tolerant operating systems for extreme conditions
> >> so it can be done.
> >>
> >> >
> >> >>
> >> >> >> > The fact is that timer device is not "just like any
> >> >> >> > other device" in virtual world. Any other device is easy: you just
> >> >> >> > implement spec as close as possible and everything works. For time
> >> >> >> > source device this is not enough. You can implement RTC+HPET to the
> >> >> >> > letter and your guest will drift like crazy.
> >> >> >>
> >> >> >> It's doable: a cycle accurate emulator will not cause any drift,
> >> >> >> without any voodoo. The interrupts would come after executing the same
> >> >> >> instruction as the real HW. For emulating any sufficiently buggy
> >> >> >> guests in any sufficiently desperate low resource conditions, this may
> >> >> >> be the only option that will always work.
> >> >> >>
> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> >> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
> >> >> > time, so any emulation done between two instruction is theoretically
> >> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
> >> >> > keeping all time source in perfect sync is also impossible.
> >> >>
> >> >> That is actually another cause of the problem. KVM gives the guest an
> >> >> illusion that the VCPU speed is equal to host speed. When they don't
> >> >> match, especially in critical code, there can be problems. It would be
> >> >> better to tell the guest a lower speed, which also can be guaranteed.
> >> >>
> >> > Not possible. It's that simple. You should take it into account in your
> >> > architecture design stage. In case of KVM real physical CPU executes guest
> >> > instruction and it does this as fast as it can. The only way we can hide
> >> > that from a guest is by intercepting each access to TSC and at that
> >> > point we can use bochs instead.
> >>
> >> Well, as Paul pointed out, there's also icount option.
> >>
> > icount is not an option for KVM.
> 
> I think icount timer adjustment model might make sense for this work
> too. We'd then just need some figure of executed CPU instructions, TSC
> cycles or even kernel scheduler time slice information (how much time
> the process got).
>
And then? icount makes guest time flow dependant on amount of emulated
instructions. It relies on the fact that all time sources are
synchronized for a guest during emulation (including TSC). This is not
true for virtualization.
 
--
			Gleb.
Gleb Natapov - May 29, 2010, 4:37 p.m.
On Sat, May 29, 2010 at 04:13:22PM +0000, Blue Swirl wrote:
> On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
> >> > I still don't see how the alternative is supposed to simplify our life
> >> > or improve the efficiency of the de-coalescing workaround. It's rather
> >> > problematic like Gleb pointed out: The de-coalescing logic needs to be
> >> > informed about periodicity changes that can only be delivered along
> >> > IRQs. So what to do with the backlog when the timer is stopped?
> >>
> >> What happens with the current design? Gleb only mentioned the
> >> frequency change, I thought that was not so big problem. But I don't
> >> think this case should be allowed happen at all, it can't exist on
> >> real HW.
> >>
> > Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
> > inside QEMU, connect with gdb to QEMU process and check what frequency
> > RTC configured with (hint: it will be 64Hz), now run video inside the
> > guest and check frequency again (hint: it will be 1Khz).
> 
> You missed the key word 'stopped'. If the timer is really stopped, no
> IRQs should ever come out afterwards, just like on real HW. For the
> emulation, this means loss of ticks which should have been delivered
> before the change.
> 
I haven't missed it. I describe to you reality of the situation. You want
to change reality to be more close to what you want it to be by adding
words to my description. Please just go write code, experiment, debug
and _then_ come here with design.

> But what if the guest changed the frequency very often, and between
> changes used zero value, like 64Hz -> 0Hz -> 128Hz -> 0Hz -> 64Hz?
Too bad, the world is not perfect.

--
			Gleb.
Blue Swirl - May 29, 2010, 8:52 p.m.
On Sat, May 29, 2010 at 4:32 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sat, May 29, 2010 at 04:03:22PM +0000, Blue Swirl wrote:
>> 2010/5/29 Gleb Natapov <gleb@redhat.com>:
>> > On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
>> >> >> There is no code, because we're still at architecture design stage.
>> >> >>
>> >> > Try to write test code to understand the problem better.
>> >>
>> >> I will.
>> >>
>> > Please do ASAP. This discussion shows that you don't understand what is the
>> > problem that we are dialing with.
>>
>> Which part of the problem you think I don't understand?
>>
> It seams to me you don't understand how Windows uses RTC for time
> keeping and how the QEMU solves the problem today.

RTC causes periodic interrupts and Windows interrupt handler
increments jiffies, like Linux?

>> >> >> >> >> guests could also be assisted with special handling (like win2k
>> >> >> >> >> install hack), for example guest instructions could be counted
>> >> >> >> >> (approximately, for example using TB size or TSC) and only inject
>> >> >> >> >> after at least N instructions have passed.
>> >> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
>> >> >> >> > or less reliably using perf counters, may be).
>> >> >> >>
>> >> >> >> Aren't there any debug registers or perf counters, which can generate
>> >> >> >> an interrupt after some number of instructions have been executed?
>> >> >> > Don't think debug registers have something like that and they are
>> >> >> > available for guest use anyway. Perf counters differs greatly from CPU
>> >> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
>> >> >> > keep using them for profiling guests. And I don't see what problem it
>> >> >> > will solve anyway that can be solved by simple delay between irq
>> >> >> > reinjection.
>> >> >>
>> >> >> This would allow counting the executed instructions and limit it. Thus
>> >> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>> >> >>
>> >> > Why would you want to limit number of instruction executed by guest if
>> >> > CPU has nothing else to do anyway? The problem occurs not when we have
>> >> > spare cycles so give to a guest, but in opposite case.
>> >>
>> >> I think one problem is that the guest has executed too much compared
>> >> to what would happen with real HW with a lesser CPU. That explains the
>> >> RTC frequency reprogramming case.
>> > You think wrong. The problem is exactly opposite: the guest haven't
>> > had enough execution time between two time interrupts. I don't know what
>> > RTC frequency reprogramming case you are talking about here.
>>
>> The case you told me where N pending tick IRQs exist but the guest
>> wants to change the RTC frequency from 64Hz to 1024Hz.
>>
>> Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
>> 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
>> the frequency to 1000Hz.
>>
>> The problem for emulation is that for the same 3 ticks, there has been
>> so little execution power that the ticks have been coalesced. But
>> isn't the guest cycle count then much lower than 30Mcyc?
>>
>> Isn't it so that the guest must be above 30Mcyc to be able to want the
>> change? But if we reach that point,  the problem must have not been
>> too little execution time, but too much.
>>
> Sorry I tried hard to understand what have you said above but failed.
> What do you mean "to be able to want the change"? Guest sometimes wants
> to get 64 timer interrupts per second and sometimes it wants to get 1024
> timer interrupt per second. It wants it not as a result of time drift or
> anything. It's just how guest behaves. You seams to be to fixated on
> guest frequency change. It's just something you have to take into
> account when you reinject interrupts.

I meant that in the scenario, the guest won't change the RTC before
30Mcyc because of some built in determinism in the guest. At that
point, because of some reason, the change would happen.

>
>> >
>> >>
>> >> >
>> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
>> >> >> >> >> > as such for a certain interval. This really does not sound like
>> >> >> >> >> > simplifying things or even make them cleaner.
>> >> >> >> >>
>> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> >> >> >> would probably eventually spread to many other devices. The logical
>> >> >> >> >> conclusion of that would be a system where all devices would be
>> >> >> >> >> careful not to disturb the guest at wrong moment because that would
>> >> >> >> >> trigger a bug.
>> >> >> >> >>
>> >> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
>> >> >> >> > pale in comparison (and I still don't see how you are going to make it
>> >> >> >> > actually work).
>> >> >> >>
>> >> >> >> Implement everything inside APIC: only coalescing and reinjection.
>> >> >> > APIC has zero info needed to implement reinjection correctly as was
>> >> >> > shown to you several time in this thread and you simply keep ignoring
>> >> >> > it.
>> >> >>
>> >> >> On the contrary, APIC is actually the only source of the IRQ ack
>> >> >> information. RTC hack would not work without APIC (or the
>> >> >> bidirectional IRQ) passing this info to RTC.
>> >> >>
>> >> >> What APIC doesn't have now is the timer frequency or period info. This
>> >> >> is known by RTC and also higher levels managing the clocks.
>> >> >>
>> >> > So APIC has one bit of information and RTC everything else.
>> >>
>> >> The information known by RTC (timer period) is also known by higher levels.
>> >>
>> > What do you mean by higher level here? vl.c or APIC.
>>
>> vl.c, qemu-timer.c.
>>
>> >> > The current
>> >> > approach (and proposed patch) brings this one bit of information to RTC,
>> >> > you are arguing that RTC should be able to communicate all its info to
>> >> > APIC. Sorry I don't see that your way has any advantage. Just more
>> >> > complex interface and it is much easier to get it wrong for other time
>> >> > sources.
>> >>
>> >> I don't think anymore that APIC should be handling this but the
>> >> generic stuff, like vl.c or exec.c. Then there would be only
>> >> information passing from APIC to higher levels.
>> > Handling reinjection by general timer code makes infinitely more sense
>> > then handling it in APIC.
>>
>> I'm glad you agree, or did you mean 'less'?
>>
> Compared to APIC I would agree that even putting it in IDE is better idea :)
>
>> > One thing (from the top of my head) that can't
>> > be implemented at that level is injection of interrupt back to back (i.e
>> > injecting next interrupt immediately after guest acknowledge previous
>> > one to RTC).
>>
>> But Jan told this confuses some buggy OSes.
>>
> You keep calling them buggy, but I don't agree. They are written with
> certain assumption that are true on real HW, but hard to achieve on
> virtual. Anyway we use this technique (back to back reinjection)
> otherwise you can't solve drift problem if guest want to receive
> timer interrupts with max frequency that host time source support.

Even if this confuses some OSes?

>> >
>> >>
>> >> >> I keep ignoring the idea that the current model, where both RTC and
>> >> >> APIC must somehow work together to make coalescing work, is the only
>> >> >> possible just because it is committed and it happens to work in some
>> >> >> cases. It would be much better to concentrate this to one place, APIC
>> >> >> or preferably higher level where it may benefit other timers too.
>> >> >> Provided of course that the other models can be made to work.
>> >> >>
>> >> > So write the code and show us. You haven't show any evidence that RTC is
>> >> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> >> > know when clock frequency changes, it know when device reset happened.
>> >> > APIC knows only that interrupt was coalesced. It doesn't even know that
>> >> > it may be masked by a guest in IOAPIC (interrupts delivered while they
>> >> > are masked not considered coalesced).
>> >>
>> >> Oh, I thought interrupt masking was the reason for coalescing! What
>> >> exactly is the reason then?
>> >>
>> > The reason is that guest has no time to process previous interrupt
>> > before it is time to inject next one.
>>
>> Because of other host load or other emulation done by the same QEMU
>> process, I suppose?
> Yes, both.
>
>>
>> >> > Time source knows only when
>> >> > frequency changes and may be when device reset happens if timer is
>> >> > stopped by device on reset. So RTC is actually a sweet spot if you want
>> >> > to minimize amount of info you need to pass between various layers.
>> >> >
>> >> >> >> Maybe that version would not bend backwards as much as the current to
>> >> >> >> cater for buggy hosts.
>> >> >> >>
>> >> >> > You mean "buggy guests"?
>> >> >>
>> >> >> Yes, sorry.
>> >> >>
>> >> >> > What guests are not buggy in your opinion?
>> >> >> > Linux tries hard to be smart and as a result the only way to have stable
>> >> >> > clock with it is to go paravirt.
>> >> >>
>> >> >> I'm not an OS designer, but I think an OS should never crash, even if
>> >> >> a burst of IRQs is received. Reprogramming the timer should consider
>> >> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>> >> >> cause of the problem.
>> >> > OS should never crash in the absence of HW bugs? I doubt you can design
>> >> > an OS that can run in a face of any HW failure. Anyway here we are
>> >> > trying to solve guests time keeping problem not crashes. Do you think
>> >> > you can design OS that can keep time accurately no matter how crazy all
>> >> > HW clock behaves?
>> >>
>> >> I think my OS design skills are not relevant in this discussion, but
>> >> IIRC there are fault tolerant operating systems for extreme conditions
>> >> so it can be done.
>> >>
>> >> >
>> >> >>
>> >> >> >> > The fact is that timer device is not "just like any
>> >> >> >> > other device" in virtual world. Any other device is easy: you just
>> >> >> >> > implement spec as close as possible and everything works. For time
>> >> >> >> > source device this is not enough. You can implement RTC+HPET to the
>> >> >> >> > letter and your guest will drift like crazy.
>> >> >> >>
>> >> >> >> It's doable: a cycle accurate emulator will not cause any drift,
>> >> >> >> without any voodoo. The interrupts would come after executing the same
>> >> >> >> instruction as the real HW. For emulating any sufficiently buggy
>> >> >> >> guests in any sufficiently desperate low resource conditions, this may
>> >> >> >> be the only option that will always work.
>> >> >> >>
>> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
>> >> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
>> >> >> > time, so any emulation done between two instruction is theoretically
>> >> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
>> >> >> > keeping all time source in perfect sync is also impossible.
>> >> >>
>> >> >> That is actually another cause of the problem. KVM gives the guest an
>> >> >> illusion that the VCPU speed is equal to host speed. When they don't
>> >> >> match, especially in critical code, there can be problems. It would be
>> >> >> better to tell the guest a lower speed, which also can be guaranteed.
>> >> >>
>> >> > Not possible. It's that simple. You should take it into account in your
>> >> > architecture design stage. In case of KVM real physical CPU executes guest
>> >> > instruction and it does this as fast as it can. The only way we can hide
>> >> > that from a guest is by intercepting each access to TSC and at that
>> >> > point we can use bochs instead.
>> >>
>> >> Well, as Paul pointed out, there's also icount option.
>> >>
>> > icount is not an option for KVM.
>>
>> I think icount timer adjustment model might make sense for this work
>> too. We'd then just need some figure of executed CPU instructions, TSC
>> cycles or even kernel scheduler time slice information (how much time
>> the process got).
>>
> And then? icount makes guest time flow dependant on amount of emulated
> instructions. It relies on the fact that all time sources are
> synchronized for a guest during emulation (including TSC). This is not
> true for virtualization.

So for virtualization, is it OK then to keep time sources unsynchronized?
Blue Swirl - May 29, 2010, 9:21 p.m.
On Sat, May 29, 2010 at 4:37 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sat, May 29, 2010 at 04:13:22PM +0000, Blue Swirl wrote:
>> On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> > On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
>> >> > I still don't see how the alternative is supposed to simplify our life
>> >> > or improve the efficiency of the de-coalescing workaround. It's rather
>> >> > problematic like Gleb pointed out: The de-coalescing logic needs to be
>> >> > informed about periodicity changes that can only be delivered along
>> >> > IRQs. So what to do with the backlog when the timer is stopped?
>> >>
>> >> What happens with the current design? Gleb only mentioned the
>> >> frequency change, I thought that was not so big problem. But I don't
>> >> think this case should be allowed happen at all, it can't exist on
>> >> real HW.
>> >>
>> > Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
>> > inside QEMU, connect with gdb to QEMU process and check what frequency
>> > RTC configured with (hint: it will be 64Hz), now run video inside the
>> > guest and check frequency again (hint: it will be 1Khz).
>>
>> You missed the key word 'stopped'. If the timer is really stopped, no
>> IRQs should ever come out afterwards, just like on real HW. For the
>> emulation, this means loss of ticks which should have been delivered
>> before the change.
>>
> I haven't missed it. I describe to you reality of the situation. You want
> to change reality to be more close to what you want it to be by adding
> words to my description.

Quoting Jan: 'So what to do with the backlog when the timer is
stopped?' I didn't add any words to your description, please be more
careful with your attributions. Why do you think I want to change the
reality?

XP frequency change isn't the same case as timer being stopped.

> Please just go write code, experiment, debug
> and _then_ come here with design.

I added some debugging to RTC, PIC and APIC. I also built a small
guest in x86 assembly to test the coalescing. However, in the tests
with this guest and others I noticed that the coalescing only happens
in some obscure conditions.

By default the APIC's delivery method for IRQs is ExtInt and
coalescing counting happens only with Fixed. This means that the guest
needs to reprogram APIC. It also looks like RTC interrupts need to be
triggered. But I didn't see both of these to happen simultaneously in
my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
must be used and I also disabled HPET to be sure that RTC would be
used.

With DEBUG_COALESCING enabled, I just get increasing numbers for
apic_irq_delivered:
apic: apic_set_irq: coalescing 67123
apic: apic_set_irq: coalescing 67124
apic: apic_set_irq: coalescing 67125

If the hack were active, the numbers would be close to zero (or at
least some point) because apic_reset_irq_delivered would be called,
but this does not happen. Could you specify a clear test case with
which the coalescing action could be tested? Linux or BSD based,
please.

>> But what if the guest changed the frequency very often, and between
>> changes used zero value, like 64Hz -> 0Hz -> 128Hz -> 0Hz -> 64Hz?
> Too bad, the world is not perfect.
>
> --
>                        Gleb.
>
Gleb Natapov - May 30, 2010, 5:41 a.m.
On Sat, May 29, 2010 at 08:52:34PM +0000, Blue Swirl wrote:
> On Sat, May 29, 2010 at 4:32 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sat, May 29, 2010 at 04:03:22PM +0000, Blue Swirl wrote:
> >> 2010/5/29 Gleb Natapov <gleb@redhat.com>:
> >> > On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
> >> >> >> There is no code, because we're still at architecture design stage.
> >> >> >>
> >> >> > Try to write test code to understand the problem better.
> >> >>
> >> >> I will.
> >> >>
> >> > Please do ASAP. This discussion shows that you don't understand what is the
> >> > problem that we are dialing with.
> >>
> >> Which part of the problem you think I don't understand?
> >>
> > It seams to me you don't understand how Windows uses RTC for time
> > keeping and how the QEMU solves the problem today.
> 
> RTC causes periodic interrupts and Windows interrupt handler
> increments jiffies, like Linux?
> 
Linux does much more complicated things than that to keep time, so the
only way to fix time drift in Linux was to introduce pvclock. For Window
it is not so accurate too, since Windows can change clock frequency any
time it can't calculate time from jiffies, it needs to update clock at
each time tick.

> >> >> >> >> >> guests could also be assisted with special handling (like win2k
> >> >> >> >> >> install hack), for example guest instructions could be counted
> >> >> >> >> >> (approximately, for example using TB size or TSC) and only inject
> >> >> >> >> >> after at least N instructions have passed.
> >> >> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
> >> >> >> >> > or less reliably using perf counters, may be).
> >> >> >> >>
> >> >> >> >> Aren't there any debug registers or perf counters, which can generate
> >> >> >> >> an interrupt after some number of instructions have been executed?
> >> >> >> > Don't think debug registers have something like that and they are
> >> >> >> > available for guest use anyway. Perf counters differs greatly from CPU
> >> >> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
> >> >> >> > keep using them for profiling guests. And I don't see what problem it
> >> >> >> > will solve anyway that can be solved by simple delay between irq
> >> >> >> > reinjection.
> >> >> >>
> >> >> >> This would allow counting the executed instructions and limit it. Thus
> >> >> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
> >> >> >>
> >> >> > Why would you want to limit number of instruction executed by guest if
> >> >> > CPU has nothing else to do anyway? The problem occurs not when we have
> >> >> > spare cycles so give to a guest, but in opposite case.
> >> >>
> >> >> I think one problem is that the guest has executed too much compared
> >> >> to what would happen with real HW with a lesser CPU. That explains the
> >> >> RTC frequency reprogramming case.
> >> > You think wrong. The problem is exactly opposite: the guest haven't
> >> > had enough execution time between two time interrupts. I don't know what
> >> > RTC frequency reprogramming case you are talking about here.
> >>
> >> The case you told me where N pending tick IRQs exist but the guest
> >> wants to change the RTC frequency from 64Hz to 1024Hz.
> >>
> >> Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
> >> 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
> >> the frequency to 1000Hz.
> >>
> >> The problem for emulation is that for the same 3 ticks, there has been
> >> so little execution power that the ticks have been coalesced. But
> >> isn't the guest cycle count then much lower than 30Mcyc?
> >>
> >> Isn't it so that the guest must be above 30Mcyc to be able to want the
> >> change? But if we reach that point,  the problem must have not been
> >> too little execution time, but too much.
> >>
> > Sorry I tried hard to understand what have you said above but failed.
> > What do you mean "to be able to want the change"? Guest sometimes wants
> > to get 64 timer interrupts per second and sometimes it wants to get 1024
> > timer interrupt per second. It wants it not as a result of time drift or
> > anything. It's just how guest behaves. You seams to be to fixated on
> > guest frequency change. It's just something you have to take into
> > account when you reinject interrupts.
> 
> I meant that in the scenario, the guest won't change the RTC before
> 30Mcyc because of some built in determinism in the guest. At that
> point, because of some reason, the change would happen.
> 
I still don't understand what are you trying to say here. Guest changes
frequency because of some even in the guest. It is totally independent
of what happens in QEMUs RTC emulation.

> >
> >> >
> >> >>
> >> >> >
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
> >> >> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
> >> >> >> >> >> > as such for a certain interval. This really does not sound like
> >> >> >> >> >> > simplifying things or even make them cleaner.
> >> >> >> >> >>
> >> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> >> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
> >> >> >> >> >> would probably eventually spread to many other devices. The logical
> >> >> >> >> >> conclusion of that would be a system where all devices would be
> >> >> >> >> >> careful not to disturb the guest at wrong moment because that would
> >> >> >> >> >> trigger a bug.
> >> >> >> >> >>
> >> >> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
> >> >> >> >> > pale in comparison (and I still don't see how you are going to make it
> >> >> >> >> > actually work).
> >> >> >> >>
> >> >> >> >> Implement everything inside APIC: only coalescing and reinjection.
> >> >> >> > APIC has zero info needed to implement reinjection correctly as was
> >> >> >> > shown to you several time in this thread and you simply keep ignoring
> >> >> >> > it.
> >> >> >>
> >> >> >> On the contrary, APIC is actually the only source of the IRQ ack
> >> >> >> information. RTC hack would not work without APIC (or the
> >> >> >> bidirectional IRQ) passing this info to RTC.
> >> >> >>
> >> >> >> What APIC doesn't have now is the timer frequency or period info. This
> >> >> >> is known by RTC and also higher levels managing the clocks.
> >> >> >>
> >> >> > So APIC has one bit of information and RTC everything else.
> >> >>
> >> >> The information known by RTC (timer period) is also known by higher levels.
> >> >>
> >> > What do you mean by higher level here? vl.c or APIC.
> >>
> >> vl.c, qemu-timer.c.
> >>
> >> >> > The current
> >> >> > approach (and proposed patch) brings this one bit of information to RTC,
> >> >> > you are arguing that RTC should be able to communicate all its info to
> >> >> > APIC. Sorry I don't see that your way has any advantage. Just more
> >> >> > complex interface and it is much easier to get it wrong for other time
> >> >> > sources.
> >> >>
> >> >> I don't think anymore that APIC should be handling this but the
> >> >> generic stuff, like vl.c or exec.c. Then there would be only
> >> >> information passing from APIC to higher levels.
> >> > Handling reinjection by general timer code makes infinitely more sense
> >> > then handling it in APIC.
> >>
> >> I'm glad you agree, or did you mean 'less'?
> >>
> > Compared to APIC I would agree that even putting it in IDE is better idea :)
> >
> >> > One thing (from the top of my head) that can't
> >> > be implemented at that level is injection of interrupt back to back (i.e
> >> > injecting next interrupt immediately after guest acknowledge previous
> >> > one to RTC).
> >>
> >> But Jan told this confuses some buggy OSes.
> >>
> > You keep calling them buggy, but I don't agree. They are written with
> > certain assumption that are true on real HW, but hard to achieve on
> > virtual. Anyway we use this technique (back to back reinjection)
> > otherwise you can't solve drift problem if guest want to receive
> > timer interrupts with max frequency that host time source support.
> 
> Even if this confuses some OSes?
> 
We make it so that it will not confuse relevant OSes. It is not line you
have better choice.

> >> >
> >> >>
> >> >> >> I keep ignoring the idea that the current model, where both RTC and
> >> >> >> APIC must somehow work together to make coalescing work, is the only
> >> >> >> possible just because it is committed and it happens to work in some
> >> >> >> cases. It would be much better to concentrate this to one place, APIC
> >> >> >> or preferably higher level where it may benefit other timers too.
> >> >> >> Provided of course that the other models can be made to work.
> >> >> >>
> >> >> > So write the code and show us. You haven't show any evidence that RTC is
> >> >> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> >> >> > know when clock frequency changes, it know when device reset happened.
> >> >> > APIC knows only that interrupt was coalesced. It doesn't even know that
> >> >> > it may be masked by a guest in IOAPIC (interrupts delivered while they
> >> >> > are masked not considered coalesced).
> >> >>
> >> >> Oh, I thought interrupt masking was the reason for coalescing! What
> >> >> exactly is the reason then?
> >> >>
> >> > The reason is that guest has no time to process previous interrupt
> >> > before it is time to inject next one.
> >>
> >> Because of other host load or other emulation done by the same QEMU
> >> process, I suppose?
> > Yes, both.
> >
> >>
> >> >> > Time source knows only when
> >> >> > frequency changes and may be when device reset happens if timer is
> >> >> > stopped by device on reset. So RTC is actually a sweet spot if you want
> >> >> > to minimize amount of info you need to pass between various layers.
> >> >> >
> >> >> >> >> Maybe that version would not bend backwards as much as the current to
> >> >> >> >> cater for buggy hosts.
> >> >> >> >>
> >> >> >> > You mean "buggy guests"?
> >> >> >>
> >> >> >> Yes, sorry.
> >> >> >>
> >> >> >> > What guests are not buggy in your opinion?
> >> >> >> > Linux tries hard to be smart and as a result the only way to have stable
> >> >> >> > clock with it is to go paravirt.
> >> >> >>
> >> >> >> I'm not an OS designer, but I think an OS should never crash, even if
> >> >> >> a burst of IRQs is received. Reprogramming the timer should consider
> >> >> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
> >> >> >> cause of the problem.
> >> >> > OS should never crash in the absence of HW bugs? I doubt you can design
> >> >> > an OS that can run in a face of any HW failure. Anyway here we are
> >> >> > trying to solve guests time keeping problem not crashes. Do you think
> >> >> > you can design OS that can keep time accurately no matter how crazy all
> >> >> > HW clock behaves?
> >> >>
> >> >> I think my OS design skills are not relevant in this discussion, but
> >> >> IIRC there are fault tolerant operating systems for extreme conditions
> >> >> so it can be done.
> >> >>
> >> >> >
> >> >> >>
> >> >> >> >> > The fact is that timer device is not "just like any
> >> >> >> >> > other device" in virtual world. Any other device is easy: you just
> >> >> >> >> > implement spec as close as possible and everything works. For time
> >> >> >> >> > source device this is not enough. You can implement RTC+HPET to the
> >> >> >> >> > letter and your guest will drift like crazy.
> >> >> >> >>
> >> >> >> >> It's doable: a cycle accurate emulator will not cause any drift,
> >> >> >> >> without any voodoo. The interrupts would come after executing the same
> >> >> >> >> instruction as the real HW. For emulating any sufficiently buggy
> >> >> >> >> guests in any sufficiently desperate low resource conditions, this may
> >> >> >> >> be the only option that will always work.
> >> >> >> >>
> >> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> >> >> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
> >> >> >> > time, so any emulation done between two instruction is theoretically
> >> >> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
> >> >> >> > keeping all time source in perfect sync is also impossible.
> >> >> >>
> >> >> >> That is actually another cause of the problem. KVM gives the guest an
> >> >> >> illusion that the VCPU speed is equal to host speed. When they don't
> >> >> >> match, especially in critical code, there can be problems. It would be
> >> >> >> better to tell the guest a lower speed, which also can be guaranteed.
> >> >> >>
> >> >> > Not possible. It's that simple. You should take it into account in your
> >> >> > architecture design stage. In case of KVM real physical CPU executes guest
> >> >> > instruction and it does this as fast as it can. The only way we can hide
> >> >> > that from a guest is by intercepting each access to TSC and at that
> >> >> > point we can use bochs instead.
> >> >>
> >> >> Well, as Paul pointed out, there's also icount option.
> >> >>
> >> > icount is not an option for KVM.
> >>
> >> I think icount timer adjustment model might make sense for this work
> >> too. We'd then just need some figure of executed CPU instructions, TSC
> >> cycles or even kernel scheduler time slice information (how much time
> >> the process got).
> >>
> > And then? icount makes guest time flow dependant on amount of emulated
> > instructions. It relies on the fact that all time sources are
> > synchronized for a guest during emulation (including TSC). This is not
> > true for virtualization.
> 
> So for virtualization, is it OK then to keep time sources unsynchronized?
No, this is not. This is the reality that can't be changed for now. May
be when HW vitalization will introduce TSC scaling, but even then we
don't want to run 4 500Mhz guest on 2GHz host, we want to run 4 2GHz
guest on 2GHz host, so overcommit occurs and you can't guaranty that
no coalescing will happen. And I hope you are aware of the fact that
using icount introduce ~10% performance penalty on QEMU.

--
			Gleb.
Gleb Natapov - May 30, 2010, 6:02 a.m.
On Sat, May 29, 2010 at 09:21:14PM +0000, Blue Swirl wrote:
> On Sat, May 29, 2010 at 4:37 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sat, May 29, 2010 at 04:13:22PM +0000, Blue Swirl wrote:
> >> On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov <gleb@redhat.com> wrote:
> >> > On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
> >> >> > I still don't see how the alternative is supposed to simplify our life
> >> >> > or improve the efficiency of the de-coalescing workaround. It's rather
> >> >> > problematic like Gleb pointed out: The de-coalescing logic needs to be
> >> >> > informed about periodicity changes that can only be delivered along
> >> >> > IRQs. So what to do with the backlog when the timer is stopped?
> >> >>
> >> >> What happens with the current design? Gleb only mentioned the
> >> >> frequency change, I thought that was not so big problem. But I don't
> >> >> think this case should be allowed happen at all, it can't exist on
> >> >> real HW.
> >> >>
> >> > Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
> >> > inside QEMU, connect with gdb to QEMU process and check what frequency
> >> > RTC configured with (hint: it will be 64Hz), now run video inside the
> >> > guest and check frequency again (hint: it will be 1Khz).
> >>
> >> You missed the key word 'stopped'. If the timer is really stopped, no
> >> IRQs should ever come out afterwards, just like on real HW. For the
> >> emulation, this means loss of ticks which should have been delivered
> >> before the change.
> >>
> > I haven't missed it. I describe to you reality of the situation. You want
> > to change reality to be more close to what you want it to be by adding
> > words to my description.
> 
> Quoting Jan: 'So what to do with the backlog when the timer is
> stopped?' I didn't add any words to your description, please be more
> careful with your attributions. Why do you think I want to change the
> reality?
Please refer to my words when you answer to my quote. You quoted my
answer to you statement:
 Gleb only mentioned the frequency change, I thought that was not so big
 problem. But I don't think this case should be allowed happen at all,
 it can't exist on real HW.
No 'stopped' was under discussion nowhere. FWIW 'stopped' is just a case
of frequency change.

> 
> XP frequency change isn't the same case as timer being stopped.
> 
And what is the big difference exactly?

> > Please just go write code, experiment, debug
> > and _then_ come here with design.
> 
> I added some debugging to RTC, PIC and APIC. I also built a small
> guest in x86 assembly to test the coalescing. However, in the tests
> with this guest and others I noticed that the coalescing only happens
> in some obscure conditions.
So try with real guest and with real load.

> 
> By default the APIC's delivery method for IRQs is ExtInt and
> coalescing counting happens only with Fixed. This means that the guest
> needs to reprogram APIC. It also looks like RTC interrupts need to be
> triggered. But I didn't see both of these to happen simultaneously in
> my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
> must be used and I also disabled HPET to be sure that RTC would be
> used.
> 
> With DEBUG_COALESCING enabled, I just get increasing numbers for
> apic_irq_delivered:
> apic: apic_set_irq: coalescing 67123
> apic: apic_set_irq: coalescing 67124
> apic: apic_set_irq: coalescing 67125
So have you actually used -rtc-td-hack option? I compiled head of
qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
and I get:
apic: apic_reset_irq_delivered: old coalescing 3
apic: apic_set_irq: coalescing 1
apic: apic_get_irq_delivered: returning coalescing 1
apic: apic_set_irq: coalescing 2
apic: apic_set_irq: coalescing 3
apic: apic_set_irq: coalescing 4
apic: apic_set_irq: coalescing 5
apic: apic_set_irq: coalescing 6
apic: apic_reset_irq_delivered: old coalescing 6
apic: apic_set_irq: coalescing 1
apic: apic_get_irq_delivered: returning coalescing 1

> 
> If the hack were active, the numbers would be close to zero (or at
> least some point) because apic_reset_irq_delivered would be called,
> but this does not happen. Could you specify a clear test case with
> which the coalescing action could be tested? Linux or BSD based,
> please.
Linux don't use RTC as time source and I don't know about BSD, so no
Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
heavy load on the host. You can run video inside the gust to trigger
coalescing more easily.

> 
> >> But what if the guest changed the frequency very often, and between
> >> changes used zero value, like 64Hz -> 0Hz -> 128Hz -> 0Hz -> 64Hz?
> > Too bad, the world is not perfect.
> >
> > --
> >                        Gleb.
> >

--
			Gleb.
Blue Swirl - May 30, 2010, 11:41 a.m.
2010/5/30 Gleb Natapov <gleb@redhat.com>:
> On Sat, May 29, 2010 at 08:52:34PM +0000, Blue Swirl wrote:
>> On Sat, May 29, 2010 at 4:32 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> > On Sat, May 29, 2010 at 04:03:22PM +0000, Blue Swirl wrote:
>> >> 2010/5/29 Gleb Natapov <gleb@redhat.com>:
>> >> > On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
>> >> >> >> There is no code, because we're still at architecture design stage.
>> >> >> >>
>> >> >> > Try to write test code to understand the problem better.
>> >> >>
>> >> >> I will.
>> >> >>
>> >> > Please do ASAP. This discussion shows that you don't understand what is the
>> >> > problem that we are dialing with.
>> >>
>> >> Which part of the problem you think I don't understand?
>> >>
>> > It seams to me you don't understand how Windows uses RTC for time
>> > keeping and how the QEMU solves the problem today.
>>
>> RTC causes periodic interrupts and Windows interrupt handler
>> increments jiffies, like Linux?
>>
> Linux does much more complicated things than that to keep time, so the
> only way to fix time drift in Linux was to introduce pvclock. For Window
> it is not so accurate too, since Windows can change clock frequency any
> time it can't calculate time from jiffies, it needs to update clock at
> each time tick.
>
>> >> >> >> >> >> guests could also be assisted with special handling (like win2k
>> >> >> >> >> >> install hack), for example guest instructions could be counted
>> >> >> >> >> >> (approximately, for example using TB size or TSC) and only inject
>> >> >> >> >> >> after at least N instructions have passed.
>> >> >> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
>> >> >> >> >> > or less reliably using perf counters, may be).
>> >> >> >> >>
>> >> >> >> >> Aren't there any debug registers or perf counters, which can generate
>> >> >> >> >> an interrupt after some number of instructions have been executed?
>> >> >> >> > Don't think debug registers have something like that and they are
>> >> >> >> > available for guest use anyway. Perf counters differs greatly from CPU
>> >> >> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
>> >> >> >> > keep using them for profiling guests. And I don't see what problem it
>> >> >> >> > will solve anyway that can be solved by simple delay between irq
>> >> >> >> > reinjection.
>> >> >> >>
>> >> >> >> This would allow counting the executed instructions and limit it. Thus
>> >> >> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>> >> >> >>
>> >> >> > Why would you want to limit number of instruction executed by guest if
>> >> >> > CPU has nothing else to do anyway? The problem occurs not when we have
>> >> >> > spare cycles so give to a guest, but in opposite case.
>> >> >>
>> >> >> I think one problem is that the guest has executed too much compared
>> >> >> to what would happen with real HW with a lesser CPU. That explains the
>> >> >> RTC frequency reprogramming case.
>> >> > You think wrong. The problem is exactly opposite: the guest haven't
>> >> > had enough execution time between two time interrupts. I don't know what
>> >> > RTC frequency reprogramming case you are talking about here.
>> >>
>> >> The case you told me where N pending tick IRQs exist but the guest
>> >> wants to change the RTC frequency from 64Hz to 1024Hz.
>> >>
>> >> Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
>> >> 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
>> >> the frequency to 1000Hz.
>> >>
>> >> The problem for emulation is that for the same 3 ticks, there has been
>> >> so little execution power that the ticks have been coalesced. But
>> >> isn't the guest cycle count then much lower than 30Mcyc?
>> >>
>> >> Isn't it so that the guest must be above 30Mcyc to be able to want the
>> >> change? But if we reach that point,  the problem must have not been
>> >> too little execution time, but too much.
>> >>
>> > Sorry I tried hard to understand what have you said above but failed.
>> > What do you mean "to be able to want the change"? Guest sometimes wants
>> > to get 64 timer interrupts per second and sometimes it wants to get 1024
>> > timer interrupt per second. It wants it not as a result of time drift or
>> > anything. It's just how guest behaves. You seams to be to fixated on
>> > guest frequency change. It's just something you have to take into
>> > account when you reinject interrupts.
>>
>> I meant that in the scenario, the guest won't change the RTC before
>> 30Mcyc because of some built in determinism in the guest. At that
>> point, because of some reason, the change would happen.
>>
> I still don't understand what are you trying to say here. Guest changes
> frequency because of some even in the guest. It is totally independent
> of what happens in QEMUs RTC emulation.

I'm trying to understand the order of events. In the scenario, the
order of events on real HW would be:
10Mcyc: tick IRQ 1
20Mcyc: tick IRQ 2
30Mcyc: tick IRQ 3
30Mcyc: reprogram timer
31Mcyc: tick IRQ 4
32Mcyc: tick IRQ 5
33Mcyc: tick IRQ 6
34Mcyc: tick IRQ 7

With QEMU, the order could become:
30Mcyc: reprogram timer
30.5Mcyc: tick IRQ 1
31Mcyc: tick IRQ 2
31.5Mcyc: tick IRQ 3
32Mcyc: tick IRQ 4
32.5Mcyc: tick IRQ 5
33Mcyc: tick IRQ 6
34Mcyc: tick IRQ 7

Correct?

>
>> >
>> >> >
>> >> >>
>> >> >> >
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> >> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
>> >> >> >> >> >> > as such for a certain interval. This really does not sound like
>> >> >> >> >> >> > simplifying things or even make them cleaner.
>> >> >> >> >> >>
>> >> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> >> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> >> >> >> >> would probably eventually spread to many other devices. The logical
>> >> >> >> >> >> conclusion of that would be a system where all devices would be
>> >> >> >> >> >> careful not to disturb the guest at wrong moment because that would
>> >> >> >> >> >> trigger a bug.
>> >> >> >> >> >>
>> >> >> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
>> >> >> >> >> > pale in comparison (and I still don't see how you are going to make it
>> >> >> >> >> > actually work).
>> >> >> >> >>
>> >> >> >> >> Implement everything inside APIC: only coalescing and reinjection.
>> >> >> >> > APIC has zero info needed to implement reinjection correctly as was
>> >> >> >> > shown to you several time in this thread and you simply keep ignoring
>> >> >> >> > it.
>> >> >> >>
>> >> >> >> On the contrary, APIC is actually the only source of the IRQ ack
>> >> >> >> information. RTC hack would not work without APIC (or the
>> >> >> >> bidirectional IRQ) passing this info to RTC.
>> >> >> >>
>> >> >> >> What APIC doesn't have now is the timer frequency or period info. This
>> >> >> >> is known by RTC and also higher levels managing the clocks.
>> >> >> >>
>> >> >> > So APIC has one bit of information and RTC everything else.
>> >> >>
>> >> >> The information known by RTC (timer period) is also known by higher levels.
>> >> >>
>> >> > What do you mean by higher level here? vl.c or APIC.
>> >>
>> >> vl.c, qemu-timer.c.
>> >>
>> >> >> > The current
>> >> >> > approach (and proposed patch) brings this one bit of information to RTC,
>> >> >> > you are arguing that RTC should be able to communicate all its info to
>> >> >> > APIC. Sorry I don't see that your way has any advantage. Just more
>> >> >> > complex interface and it is much easier to get it wrong for other time
>> >> >> > sources.
>> >> >>
>> >> >> I don't think anymore that APIC should be handling this but the
>> >> >> generic stuff, like vl.c or exec.c. Then there would be only
>> >> >> information passing from APIC to higher levels.
>> >> > Handling reinjection by general timer code makes infinitely more sense
>> >> > then handling it in APIC.
>> >>
>> >> I'm glad you agree, or did you mean 'less'?
>> >>
>> > Compared to APIC I would agree that even putting it in IDE is better idea :)
>> >
>> >> > One thing (from the top of my head) that can't
>> >> > be implemented at that level is injection of interrupt back to back (i.e
>> >> > injecting next interrupt immediately after guest acknowledge previous
>> >> > one to RTC).
>> >>
>> >> But Jan told this confuses some buggy OSes.
>> >>
>> > You keep calling them buggy, but I don't agree. They are written with
>> > certain assumption that are true on real HW, but hard to achieve on
>> > virtual. Anyway we use this technique (back to back reinjection)
>> > otherwise you can't solve drift problem if guest want to receive
>> > timer interrupts with max frequency that host time source support.
>>
>> Even if this confuses some OSes?
>>
> We make it so that it will not confuse relevant OSes. It is not line you
> have better choice.
>
>> >> >
>> >> >>
>> >> >> >> I keep ignoring the idea that the current model, where both RTC and
>> >> >> >> APIC must somehow work together to make coalescing work, is the only
>> >> >> >> possible just because it is committed and it happens to work in some
>> >> >> >> cases. It would be much better to concentrate this to one place, APIC
>> >> >> >> or preferably higher level where it may benefit other timers too.
>> >> >> >> Provided of course that the other models can be made to work.
>> >> >> >>
>> >> >> > So write the code and show us. You haven't show any evidence that RTC is
>> >> >> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> >> >> > know when clock frequency changes, it know when device reset happened.
>> >> >> > APIC knows only that interrupt was coalesced. It doesn't even know that
>> >> >> > it may be masked by a guest in IOAPIC (interrupts delivered while they
>> >> >> > are masked not considered coalesced).
>> >> >>
>> >> >> Oh, I thought interrupt masking was the reason for coalescing! What
>> >> >> exactly is the reason then?
>> >> >>
>> >> > The reason is that guest has no time to process previous interrupt
>> >> > before it is time to inject next one.
>> >>
>> >> Because of other host load or other emulation done by the same QEMU
>> >> process, I suppose?
>> > Yes, both.
>> >
>> >>
>> >> >> > Time source knows only when
>> >> >> > frequency changes and may be when device reset happens if timer is
>> >> >> > stopped by device on reset. So RTC is actually a sweet spot if you want
>> >> >> > to minimize amount of info you need to pass between various layers.
>> >> >> >
>> >> >> >> >> Maybe that version would not bend backwards as much as the current to
>> >> >> >> >> cater for buggy hosts.
>> >> >> >> >>
>> >> >> >> > You mean "buggy guests"?
>> >> >> >>
>> >> >> >> Yes, sorry.
>> >> >> >>
>> >> >> >> > What guests are not buggy in your opinion?
>> >> >> >> > Linux tries hard to be smart and as a result the only way to have stable
>> >> >> >> > clock with it is to go paravirt.
>> >> >> >>
>> >> >> >> I'm not an OS designer, but I think an OS should never crash, even if
>> >> >> >> a burst of IRQs is received. Reprogramming the timer should consider
>> >> >> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>> >> >> >> cause of the problem.
>> >> >> > OS should never crash in the absence of HW bugs? I doubt you can design
>> >> >> > an OS that can run in a face of any HW failure. Anyway here we are
>> >> >> > trying to solve guests time keeping problem not crashes. Do you think
>> >> >> > you can design OS that can keep time accurately no matter how crazy all
>> >> >> > HW clock behaves?
>> >> >>
>> >> >> I think my OS design skills are not relevant in this discussion, but
>> >> >> IIRC there are fault tolerant operating systems for extreme conditions
>> >> >> so it can be done.
>> >> >>
>> >> >> >
>> >> >> >>
>> >> >> >> >> > The fact is that timer device is not "just like any
>> >> >> >> >> > other device" in virtual world. Any other device is easy: you just
>> >> >> >> >> > implement spec as close as possible and everything works. For time
>> >> >> >> >> > source device this is not enough. You can implement RTC+HPET to the
>> >> >> >> >> > letter and your guest will drift like crazy.
>> >> >> >> >>
>> >> >> >> >> It's doable: a cycle accurate emulator will not cause any drift,
>> >> >> >> >> without any voodoo. The interrupts would come after executing the same
>> >> >> >> >> instruction as the real HW. For emulating any sufficiently buggy
>> >> >> >> >> guests in any sufficiently desperate low resource conditions, this may
>> >> >> >> >> be the only option that will always work.
>> >> >> >> >>
>> >> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
>> >> >> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
>> >> >> >> > time, so any emulation done between two instruction is theoretically
>> >> >> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
>> >> >> >> > keeping all time source in perfect sync is also impossible.
>> >> >> >>
>> >> >> >> That is actually another cause of the problem. KVM gives the guest an
>> >> >> >> illusion that the VCPU speed is equal to host speed. When they don't
>> >> >> >> match, especially in critical code, there can be problems. It would be
>> >> >> >> better to tell the guest a lower speed, which also can be guaranteed.
>> >> >> >>
>> >> >> > Not possible. It's that simple. You should take it into account in your
>> >> >> > architecture design stage. In case of KVM real physical CPU executes guest
>> >> >> > instruction and it does this as fast as it can. The only way we can hide
>> >> >> > that from a guest is by intercepting each access to TSC and at that
>> >> >> > point we can use bochs instead.
>> >> >>
>> >> >> Well, as Paul pointed out, there's also icount option.
>> >> >>
>> >> > icount is not an option for KVM.
>> >>
>> >> I think icount timer adjustment model might make sense for this work
>> >> too. We'd then just need some figure of executed CPU instructions, TSC
>> >> cycles or even kernel scheduler time slice information (how much time
>> >> the process got).
>> >>
>> > And then? icount makes guest time flow dependant on amount of emulated
>> > instructions. It relies on the fact that all time sources are
>> > synchronized for a guest during emulation (including TSC). This is not
>> > true for virtualization.
>>
>> So for virtualization, is it OK then to keep time sources unsynchronized?
> No, this is not. This is the reality that can't be changed for now. May
> be when HW vitalization will introduce TSC scaling, but even then we
> don't want to run 4 500Mhz guest on 2GHz host, we want to run 4 2GHz
> guest on 2GHz host, so overcommit occurs and you can't guaranty that
> no coalescing will happen. And I hope you are aware of the fact that
> using icount introduce ~10% performance penalty on QEMU.
>
> --
>                        Gleb.
>
Gleb Natapov - May 30, 2010, 11:52 a.m.
On Sun, May 30, 2010 at 11:41:36AM +0000, Blue Swirl wrote:
> >> I meant that in the scenario, the guest won't change the RTC before
> >> 30Mcyc because of some built in determinism in the guest. At that
> >> point, because of some reason, the change would happen.
> >>
> > I still don't understand what are you trying to say here. Guest changes
> > frequency because of some even in the guest. It is totally independent
> > of what happens in QEMUs RTC emulation.
> 
> I'm trying to understand the order of events. In the scenario, the
> order of events on real HW would be:
> 10Mcyc: tick IRQ 1
> 20Mcyc: tick IRQ 2
> 30Mcyc: tick IRQ 3
> 30Mcyc: reprogram timer
> 31Mcyc: tick IRQ 4
> 32Mcyc: tick IRQ 5
> 33Mcyc: tick IRQ 6
> 34Mcyc: tick IRQ 7
> 
> With QEMU, the order could become:
> 30Mcyc: reprogram timer
> 30.5Mcyc: tick IRQ 1
> 31Mcyc: tick IRQ 2
> 31.5Mcyc: tick IRQ 3
> 32Mcyc: tick IRQ 4
> 32.5Mcyc: tick IRQ 5
> 33Mcyc: tick IRQ 6
> 34Mcyc: tick IRQ 7
> 
> Correct?
Not sure, your description is not complete. Let me try:
On real HW:
10Mcyc: tick IRQ 1 -> delivered to an OS
20Mcyc: tick IRQ 2 -> delivered to an OS
30Mcyc: tick IRQ 3 -> delivered to an OS
30Mcyc: reprogram timer
31Mcyc: tick IRQ 4 -> delivered to an OS
32Mcyc: tick IRQ 5 -> delivered to an OS
33Mcyc: tick IRQ 6 -> delivered to an OS
34Mcyc: tick IRQ 7 -> delivered to an OS

With QEMU:
10Mcyc: tick IRQ 1 -> coalesced
20Mcyc: tick IRQ 2 -> coalesced
30Mcyc: tick IRQ 3 -> coalesced
30Mcyc: reprogram timer
30.1Mcyc: tick IRQ 1 -> delivered to an OS
30.2Mcyc: tick IRQ 2 -> delivered to an OS
30.3Mcyc: tick IRQ 3 -> delivered to an OS
31Mcyc: tick IRQ 4 -> delivered to an OS
32Mcyc: tick IRQ 5 -> delivered to an OS
33Mcyc: tick IRQ 6 -> delivered to an OS
34Mcyc: tick IRQ 7 -> delivered to an OS



--
			Gleb.
Avi Kivity - May 30, 2010, noon
On 05/28/2010 01:19 AM, Jan Kiszka wrote:
>
> Still, this does not answer:
>
> - How do you want to detect lost timer ticks?
>
>    

Your (and Gleb's) approach: during injection
An alternative: during guest ack

The normal pattern is inj/ack/inj/ack; if we see inj/inj/inj we know the 
guest isn't keeping up.

> - What subsystem(s) keeps track of the backlog?
>    

Clearly, the clock chip (hpet/rtc/pit).

With the alternative approach, the clock emulation requests that acks be 
reported via a qemu_irq interface.

> - And depending on the above: How to detect at all that a specific IRQ
>    is a timer tick?
>    

Clearly, a blocker unless the clock is responsible for timekeeping.
Avi Kivity - May 30, 2010, 12:05 p.m.
On 05/29/2010 12:15 PM, Blue Swirl wrote:
>
>>> This would allow counting the executed instructions and limit it. Thus
>>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>>
>>>        
>> Why would you want to limit number of instruction executed by guest if
>> CPU has nothing else to do anyway? The problem occurs not when we have
>> spare cycles so give to a guest, but in opposite case.
>>      
> I think one problem is that the guest has executed too much compared
> to what would happen with real HW with a lesser CPU. That explains the
> RTC frequency reprogramming case.
>    

The root cause is that while qemu is scheduled out, the real time clock 
keeps ticking.  Since we can't stop real time, we must compensate in 
other ways.

>> So write the code and show us. You haven't show any evidence that RTC is
>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> know when clock frequency changes, it know when device reset happened.
>> APIC knows only that interrupt was coalesced. It doesn't even know that
>> it may be masked by a guest in IOAPIC (interrupts delivered while they
>> are masked not considered coalesced).
>>      
> Oh, I thought interrupt masking was the reason for coalescing! What
> exactly is the reason then?
>    

The above.
Blue Swirl - May 30, 2010, 12:10 p.m.
2010/5/30 Gleb Natapov <gleb@redhat.com>:
> On Sat, May 29, 2010 at 09:21:14PM +0000, Blue Swirl wrote:
>> On Sat, May 29, 2010 at 4:37 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> > On Sat, May 29, 2010 at 04:13:22PM +0000, Blue Swirl wrote:
>> >> On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> >> > On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
>> >> >> > I still don't see how the alternative is supposed to simplify our life
>> >> >> > or improve the efficiency of the de-coalescing workaround. It's rather
>> >> >> > problematic like Gleb pointed out: The de-coalescing logic needs to be
>> >> >> > informed about periodicity changes that can only be delivered along
>> >> >> > IRQs. So what to do with the backlog when the timer is stopped?
>> >> >>
>> >> >> What happens with the current design? Gleb only mentioned the
>> >> >> frequency change, I thought that was not so big problem. But I don't
>> >> >> think this case should be allowed happen at all, it can't exist on
>> >> >> real HW.
>> >> >>
>> >> > Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
>> >> > inside QEMU, connect with gdb to QEMU process and check what frequency
>> >> > RTC configured with (hint: it will be 64Hz), now run video inside the
>> >> > guest and check frequency again (hint: it will be 1Khz).
>> >>
>> >> You missed the key word 'stopped'. If the timer is really stopped, no
>> >> IRQs should ever come out afterwards, just like on real HW. For the
>> >> emulation, this means loss of ticks which should have been delivered
>> >> before the change.
>> >>
>> > I haven't missed it. I describe to you reality of the situation. You want
>> > to change reality to be more close to what you want it to be by adding
>> > words to my description.
>>
>> Quoting Jan: 'So what to do with the backlog when the timer is
>> stopped?' I didn't add any words to your description, please be more
>> careful with your attributions. Why do you think I want to change the
>> reality?
> Please refer to my words when you answer to my quote. You quoted my
> answer to you statement:
>  Gleb only mentioned the frequency change, I thought that was not so big
>  problem. But I don't think this case should be allowed happen at all,
>  it can't exist on real HW.

With 'this case' I was referring to 'case with timer stopped', not
'case which Gleb mentioned'.

> No 'stopped' was under discussion nowhere.

It's clearly written there in the sentence Jan wrote.

> FWIW 'stopped' is just a case
> of frequency change.

True.

>
>>
>> XP frequency change isn't the same case as timer being stopped.
>>
> And what is the big difference exactly?

Because after the timer is stopped, its extremely unrealistic to send
any IRQs. Whereas if the frequency is changed to some other nonzero
value, we can cheat and inject some more queued IRQs.

Anyway, if this case is not interesting because it doesn't happen in
real life emulation scenarios, we can forget it no matter how buggy
the current QEMU implementation is.

>> > Please just go write code, experiment, debug
>> > and _then_ come here with design.
>>
>> I added some debugging to RTC, PIC and APIC. I also built a small
>> guest in x86 assembly to test the coalescing. However, in the tests
>> with this guest and others I noticed that the coalescing only happens
>> in some obscure conditions.
> So try with real guest and with real load.

Well, I'd like to get the test program also trigger it. Now I'm getting:
apic: write: 00000350 = 00000000
apic: apic_reset_irq_delivered: old coalescing 0
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 1
apic: apic_get_irq_delivered: returning coalescing 1
apic: apic_reset_irq_delivered: old coalescing 1
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 0
apic: apic_get_irq_delivered: returning coalescing 0
apic: apic_reset_irq_delivered: old coalescing 0
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 0

It looks like some other IRQs cause the coalescing, because also
looking at RTC code, it seems it's not possible for RTC to raise the
IRQ (except update IRQ, alarm etc.) without calling
apic_reset_irq_delivered().

I've attached my test program. Compile:
gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin

Run:
qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack

>>
>> By default the APIC's delivery method for IRQs is ExtInt and
>> coalescing counting happens only with Fixed. This means that the guest
>> needs to reprogram APIC. It also looks like RTC interrupts need to be
>> triggered. But I didn't see both of these to happen simultaneously in
>> my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
>> must be used and I also disabled HPET to be sure that RTC would be
>> used.
>>
>> With DEBUG_COALESCING enabled, I just get increasing numbers for
>> apic_irq_delivered:
>> apic: apic_set_irq: coalescing 67123
>> apic: apic_set_irq: coalescing 67124
>> apic: apic_set_irq: coalescing 67125
> So have you actually used -rtc-td-hack option? I compiled head of
> qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
> and I get:
> apic: apic_reset_irq_delivered: old coalescing 3
> apic: apic_set_irq: coalescing 1
> apic: apic_get_irq_delivered: returning coalescing 1
> apic: apic_set_irq: coalescing 2
> apic: apic_set_irq: coalescing 3
> apic: apic_set_irq: coalescing 4
> apic: apic_set_irq: coalescing 5
> apic: apic_set_irq: coalescing 6
> apic: apic_reset_irq_delivered: old coalescing 6
> apic: apic_set_irq: coalescing 1
> apic: apic_get_irq_delivered: returning coalescing 1
>
>>
>> If the hack were active, the numbers would be close to zero (or at
>> least some point) because apic_reset_irq_delivered would be called,
>> but this does not happen. Could you specify a clear test case with
>> which the coalescing action could be tested? Linux or BSD based,
>> please.
> Linux don't use RTC as time source and I don't know about BSD, so no
> Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
> heavy load on the host. You can run video inside the gust to trigger
> coalescing more easily.

I don't have Windows XP, sorry.

>
>>
>> >> But what if the guest changed the frequency very often, and between
>> >> changes used zero value, like 64Hz -> 0Hz -> 128Hz -> 0Hz -> 64Hz?
>> > Too bad, the world is not perfect.
>> >
>> > --
>> >                        Gleb.
>> >
>
> --
>                        Gleb.
>
Jan Kiszka - May 30, 2010, 12:24 p.m.
Blue Swirl wrote:
>> Linux don't use RTC as time source and I don't know about BSD, so no
>> Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
>> heavy load on the host. You can run video inside the gust to trigger
>> coalescing more easily.
> 
> I don't have Windows XP, sorry.

ReactOS [1], at least its 32-bit version, appears to use the RTC as well.

Jan

[1] http://www.reactos.org/de/download.html
Gleb Natapov - May 30, 2010, 12:33 p.m.
On Sun, May 30, 2010 at 12:10:16PM +0000, Blue Swirl wrote:
> >> >> You missed the key word 'stopped'. If the timer is really stopped, no
> >> >> IRQs should ever come out afterwards, just like on real HW. For the
> >> >> emulation, this means loss of ticks which should have been delivered
> >> >> before the change.
> >> >>
> >> > I haven't missed it. I describe to you reality of the situation. You want
> >> > to change reality to be more close to what you want it to be by adding
> >> > words to my description.
> >>
> >> Quoting Jan: 'So what to do with the backlog when the timer is
> >> stopped?' I didn't add any words to your description, please be more
> >> careful with your attributions. Why do you think I want to change the
> >> reality?
> > Please refer to my words when you answer to my quote. You quoted my
> > answer to you statement:
> >  Gleb only mentioned the frequency change, I thought that was not so big
> >  problem. But I don't think this case should be allowed happen at all,
> >  it can't exist on real HW.
> 
> With 'this case' I was referring to 'case with timer stopped', not
> 'case which Gleb mentioned'.
> 
> > No 'stopped' was under discussion nowhere.
> 
> It's clearly written there in the sentence Jan wrote.
> 
Jan, not me, but lets leave this topic alone since you agree that
stopped is just a case of frequency change anyway.

> > FWIW 'stopped' is just a case
> > of frequency change.
> 
> True.
> 
> >
> >>
> >> XP frequency change isn't the same case as timer being stopped.
> >>
> > And what is the big difference exactly?
> 
> Because after the timer is stopped, its extremely unrealistic to send
> any IRQs. Whereas if the frequency is changed to some other nonzero
> value, we can cheat and inject some more queued IRQs.
> 
Correct, when gets disables clock source (by reset or any other means)
coalesced backlog should be forgotten.

> Anyway, if this case is not interesting because it doesn't happen in
> real life emulation scenarios, we can forget it no matter how buggy
> the current QEMU implementation is.
> 
> >> > Please just go write code, experiment, debug
> >> > and _then_ come here with design.
> >>
> >> I added some debugging to RTC, PIC and APIC. I also built a small
> >> guest in x86 assembly to test the coalescing. However, in the tests
> >> with this guest and others I noticed that the coalescing only happens
> >> in some obscure conditions.
> > So try with real guest and with real load.
> 
> Well, I'd like to get the test program also trigger it. Now I'm getting:
> apic: write: 00000350 = 00000000
> apic: apic_reset_irq_delivered: old coalescing 0
> apic: apic_local_deliver: vector 3 delivery mode 0
> apic: apic_set_irq: coalescing 1
> apic: apic_get_irq_delivered: returning coalescing 1
> apic: apic_reset_irq_delivered: old coalescing 1
> apic: apic_local_deliver: vector 3 delivery mode 0
> apic: apic_set_irq: coalescing 0
> apic: apic_get_irq_delivered: returning coalescing 0
> apic: apic_reset_irq_delivered: old coalescing 0
> apic: apic_local_deliver: vector 3 delivery mode 0
> apic: apic_set_irq: coalescing 0
> 
> It looks like some other IRQs cause the coalescing, because also
> looking at RTC code, it seems it's not possible for RTC to raise the
> IRQ (except update IRQ, alarm etc.) without calling
> apic_reset_irq_delivered().
> 
> I've attached my test program. Compile:
> gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
> coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin
> 
> Run:
> qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
> 
The application does not work for me. Looks like it fails to enter
protected mode. $pc jumps from 0x00000000fffffff0 to 0x00000000000f003e
and back.

> >>
> >> By default the APIC's delivery method for IRQs is ExtInt and
> >> coalescing counting happens only with Fixed. This means that the guest
> >> needs to reprogram APIC. It also looks like RTC interrupts need to be
> >> triggered. But I didn't see both of these to happen simultaneously in
> >> my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
> >> must be used and I also disabled HPET to be sure that RTC would be
> >> used.
> >>
> >> With DEBUG_COALESCING enabled, I just get increasing numbers for
> >> apic_irq_delivered:
> >> apic: apic_set_irq: coalescing 67123
> >> apic: apic_set_irq: coalescing 67124
> >> apic: apic_set_irq: coalescing 67125
> > So have you actually used -rtc-td-hack option? I compiled head of
> > qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
> > and I get:
> > apic: apic_reset_irq_delivered: old coalescing 3
> > apic: apic_set_irq: coalescing 1
> > apic: apic_get_irq_delivered: returning coalescing 1
> > apic: apic_set_irq: coalescing 2
> > apic: apic_set_irq: coalescing 3
> > apic: apic_set_irq: coalescing 4
> > apic: apic_set_irq: coalescing 5
> > apic: apic_set_irq: coalescing 6
> > apic: apic_reset_irq_delivered: old coalescing 6
> > apic: apic_set_irq: coalescing 1
> > apic: apic_get_irq_delivered: returning coalescing 1
> >
> >>
> >> If the hack were active, the numbers would be close to zero (or at
> >> least some point) because apic_reset_irq_delivered would be called,
> >> but this does not happen. Could you specify a clear test case with
> >> which the coalescing action could be tested? Linux or BSD based,
> >> please.
> > Linux don't use RTC as time source and I don't know about BSD, so no
> > Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
> > heavy load on the host. You can run video inside the gust to trigger
> > coalescing more easily.
> 
> I don't have Windows XP, sorry.
> 
Will be hard to debug Windows time drift without Windows ;) Do you know
what time source BSD uses?

--
			Gleb.
Blue Swirl - May 30, 2010, 12:56 p.m.
2010/5/30 Gleb Natapov <gleb@redhat.com>:
> On Sun, May 30, 2010 at 12:10:16PM +0000, Blue Swirl wrote:
>> >> >> You missed the key word 'stopped'. If the timer is really stopped, no
>> >> >> IRQs should ever come out afterwards, just like on real HW. For the
>> >> >> emulation, this means loss of ticks which should have been delivered
>> >> >> before the change.
>> >> >>
>> >> > I haven't missed it. I describe to you reality of the situation. You want
>> >> > to change reality to be more close to what you want it to be by adding
>> >> > words to my description.
>> >>
>> >> Quoting Jan: 'So what to do with the backlog when the timer is
>> >> stopped?' I didn't add any words to your description, please be more
>> >> careful with your attributions. Why do you think I want to change the
>> >> reality?
>> > Please refer to my words when you answer to my quote. You quoted my
>> > answer to you statement:
>> >  Gleb only mentioned the frequency change, I thought that was not so big
>> >  problem. But I don't think this case should be allowed happen at all,
>> >  it can't exist on real HW.
>>
>> With 'this case' I was referring to 'case with timer stopped', not
>> 'case which Gleb mentioned'.
>>
>> > No 'stopped' was under discussion nowhere.
>>
>> It's clearly written there in the sentence Jan wrote.
>>
> Jan, not me, but lets leave this topic alone since you agree that
> stopped is just a case of frequency change anyway.
>
>> > FWIW 'stopped' is just a case
>> > of frequency change.
>>
>> True.
>>
>> >
>> >>
>> >> XP frequency change isn't the same case as timer being stopped.
>> >>
>> > And what is the big difference exactly?
>>
>> Because after the timer is stopped, its extremely unrealistic to send
>> any IRQs. Whereas if the frequency is changed to some other nonzero
>> value, we can cheat and inject some more queued IRQs.
>>
> Correct, when gets disables clock source (by reset or any other means)
> coalesced backlog should be forgotten.
>
>> Anyway, if this case is not interesting because it doesn't happen in
>> real life emulation scenarios, we can forget it no matter how buggy
>> the current QEMU implementation is.
>>
>> >> > Please just go write code, experiment, debug
>> >> > and _then_ come here with design.
>> >>
>> >> I added some debugging to RTC, PIC and APIC. I also built a small
>> >> guest in x86 assembly to test the coalescing. However, in the tests
>> >> with this guest and others I noticed that the coalescing only happens
>> >> in some obscure conditions.
>> > So try with real guest and with real load.
>>
>> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> apic: write: 00000350 = 00000000
>> apic: apic_reset_irq_delivered: old coalescing 0
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 1
>> apic: apic_get_irq_delivered: returning coalescing 1
>> apic: apic_reset_irq_delivered: old coalescing 1
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 0
>> apic: apic_get_irq_delivered: returning coalescing 0
>> apic: apic_reset_irq_delivered: old coalescing 0
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 0
>>
>> It looks like some other IRQs cause the coalescing, because also
>> looking at RTC code, it seems it's not possible for RTC to raise the
>> IRQ (except update IRQ, alarm etc.) without calling
>> apic_reset_irq_delivered().
>>
>> I've attached my test program. Compile:
>> gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
>> coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin
>>
>> Run:
>> qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
>>
> The application does not work for me. Looks like it fails to enter
> protected mode. $pc jumps from 0x00000000fffffff0 to 0x00000000000f003e
> and back.

Strange. Here's a working binary.

>
>> >>
>> >> By default the APIC's delivery method for IRQs is ExtInt and
>> >> coalescing counting happens only with Fixed. This means that the guest
>> >> needs to reprogram APIC. It also looks like RTC interrupts need to be
>> >> triggered. But I didn't see both of these to happen simultaneously in
>> >> my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
>> >> must be used and I also disabled HPET to be sure that RTC would be
>> >> used.
>> >>
>> >> With DEBUG_COALESCING enabled, I just get increasing numbers for
>> >> apic_irq_delivered:
>> >> apic: apic_set_irq: coalescing 67123
>> >> apic: apic_set_irq: coalescing 67124
>> >> apic: apic_set_irq: coalescing 67125
>> > So have you actually used -rtc-td-hack option? I compiled head of
>> > qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
>> > and I get:
>> > apic: apic_reset_irq_delivered: old coalescing 3
>> > apic: apic_set_irq: coalescing 1
>> > apic: apic_get_irq_delivered: returning coalescing 1
>> > apic: apic_set_irq: coalescing 2
>> > apic: apic_set_irq: coalescing 3
>> > apic: apic_set_irq: coalescing 4
>> > apic: apic_set_irq: coalescing 5
>> > apic: apic_set_irq: coalescing 6
>> > apic: apic_reset_irq_delivered: old coalescing 6
>> > apic: apic_set_irq: coalescing 1
>> > apic: apic_get_irq_delivered: returning coalescing 1
>> >
>> >>
>> >> If the hack were active, the numbers would be close to zero (or at
>> >> least some point) because apic_reset_irq_delivered would be called,
>> >> but this does not happen. Could you specify a clear test case with
>> >> which the coalescing action could be tested? Linux or BSD based,
>> >> please.
>> > Linux don't use RTC as time source and I don't know about BSD, so no
>> > Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
>> > heavy load on the host. You can run video inside the gust to trigger
>> > coalescing more easily.
>>
>> I don't have Windows XP, sorry.
>>
> Will be hard to debug Windows time drift without Windows ;) Do you know
> what time source BSD uses?

No, at least OpenBSD 4.4 doesn't seem to use RTC.
Blue Swirl - May 30, 2010, 12:58 p.m.
On Sun, May 30, 2010 at 12:24 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>>> Linux don't use RTC as time source and I don't know about BSD, so no
>>> Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
>>> heavy load on the host. You can run video inside the gust to trigger
>>> coalescing more easily.
>>
>> I don't have Windows XP, sorry.
>
> ReactOS [1], at least its 32-bit version, appears to use the RTC as well.

I tried LiveCD and QEMU versions, both seem to hang at boot. Is that expected?
Blue Swirl - May 30, 2010, 1:22 p.m.
2010/5/30 Gleb Natapov <gleb@redhat.com>:
> On Sun, May 30, 2010 at 12:10:16PM +0000, Blue Swirl wrote:
>> >> >> You missed the key word 'stopped'. If the timer is really stopped, no
>> >> >> IRQs should ever come out afterwards, just like on real HW. For the
>> >> >> emulation, this means loss of ticks which should have been delivered
>> >> >> before the change.
>> >> >>
>> >> > I haven't missed it. I describe to you reality of the situation. You want
>> >> > to change reality to be more close to what you want it to be by adding
>> >> > words to my description.
>> >>
>> >> Quoting Jan: 'So what to do with the backlog when the timer is
>> >> stopped?' I didn't add any words to your description, please be more
>> >> careful with your attributions. Why do you think I want to change the
>> >> reality?
>> > Please refer to my words when you answer to my quote. You quoted my
>> > answer to you statement:
>> >  Gleb only mentioned the frequency change, I thought that was not so big
>> >  problem. But I don't think this case should be allowed happen at all,
>> >  it can't exist on real HW.
>>
>> With 'this case' I was referring to 'case with timer stopped', not
>> 'case which Gleb mentioned'.
>>
>> > No 'stopped' was under discussion nowhere.
>>
>> It's clearly written there in the sentence Jan wrote.
>>
> Jan, not me, but lets leave this topic alone since you agree that
> stopped is just a case of frequency change anyway.
>
>> > FWIW 'stopped' is just a case
>> > of frequency change.
>>
>> True.
>>
>> >
>> >>
>> >> XP frequency change isn't the same case as timer being stopped.
>> >>
>> > And what is the big difference exactly?
>>
>> Because after the timer is stopped, its extremely unrealistic to send
>> any IRQs. Whereas if the frequency is changed to some other nonzero
>> value, we can cheat and inject some more queued IRQs.
>>
> Correct, when gets disables clock source (by reset or any other means)
> coalesced backlog should be forgotten.
>
>> Anyway, if this case is not interesting because it doesn't happen in
>> real life emulation scenarios, we can forget it no matter how buggy
>> the current QEMU implementation is.
>>
>> >> > Please just go write code, experiment, debug
>> >> > and _then_ come here with design.
>> >>
>> >> I added some debugging to RTC, PIC and APIC. I also built a small
>> >> guest in x86 assembly to test the coalescing. However, in the tests
>> >> with this guest and others I noticed that the coalescing only happens
>> >> in some obscure conditions.
>> > So try with real guest and with real load.
>>
>> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> apic: write: 00000350 = 00000000
>> apic: apic_reset_irq_delivered: old coalescing 0
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 1
>> apic: apic_get_irq_delivered: returning coalescing 1
>> apic: apic_reset_irq_delivered: old coalescing 1
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 0
>> apic: apic_get_irq_delivered: returning coalescing 0
>> apic: apic_reset_irq_delivered: old coalescing 0
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 0
>>
>> It looks like some other IRQs cause the coalescing, because also
>> looking at RTC code, it seems it's not possible for RTC to raise the
>> IRQ (except update IRQ, alarm etc.) without calling
>> apic_reset_irq_delivered().
>>
>> I've attached my test program. Compile:
>> gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
>> coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin
>>
>> Run:
>> qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
>>
> The application does not work for me. Looks like it fails to enter
> protected mode. $pc jumps from 0x00000000fffffff0 to 0x00000000000f003e
> and back.
>
>> >>
>> >> By default the APIC's delivery method for IRQs is ExtInt and
>> >> coalescing counting happens only with Fixed. This means that the guest
>> >> needs to reprogram APIC. It also looks like RTC interrupts need to be
>> >> triggered. But I didn't see both of these to happen simultaneously in
>> >> my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
>> >> must be used and I also disabled HPET to be sure that RTC would be
>> >> used.
>> >>
>> >> With DEBUG_COALESCING enabled, I just get increasing numbers for
>> >> apic_irq_delivered:
>> >> apic: apic_set_irq: coalescing 67123
>> >> apic: apic_set_irq: coalescing 67124
>> >> apic: apic_set_irq: coalescing 67125
>> > So have you actually used -rtc-td-hack option? I compiled head of
>> > qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
>> > and I get:
>> > apic: apic_reset_irq_delivered: old coalescing 3
>> > apic: apic_set_irq: coalescing 1
>> > apic: apic_get_irq_delivered: returning coalescing 1
>> > apic: apic_set_irq: coalescing 2
>> > apic: apic_set_irq: coalescing 3
>> > apic: apic_set_irq: coalescing 4
>> > apic: apic_set_irq: coalescing 5
>> > apic: apic_set_irq: coalescing 6
>> > apic: apic_reset_irq_delivered: old coalescing 6
>> > apic: apic_set_irq: coalescing 1
>> > apic: apic_get_irq_delivered: returning coalescing 1
>> >
>> >>
>> >> If the hack were active, the numbers would be close to zero (or at
>> >> least some point) because apic_reset_irq_delivered would be called,
>> >> but this does not happen. Could you specify a clear test case with
>> >> which the coalescing action could be tested? Linux or BSD based,
>> >> please.
>> > Linux don't use RTC as time source and I don't know about BSD, so no
>> > Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
>> > heavy load on the host. You can run video inside the gust to trigger
>> > coalescing more easily.
>>
>> I don't have Windows XP, sorry.
>>
> Will be hard to debug Windows time drift without Windows ;) Do you know
> what time source BSD uses?

Seems to be PIT. OpenBSD:
http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/i386/isa/clock.c?rev=1.42

FreeBSD:
http://fxr.watson.org/fxr/source/x86/isa/clock.c?im=bigexcerpts

Didn't find NetBSD clock.
Gleb Natapov - May 30, 2010, 1:49 p.m.
On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
> >> apic: write: 00000350 = 00000000
> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> apic: apic_set_irq: coalescing 1
> >> apic: apic_get_irq_delivered: returning coalescing 1
> >> apic: apic_reset_irq_delivered: old coalescing 1
> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> apic: apic_set_irq: coalescing 0
> >> apic: apic_get_irq_delivered: returning coalescing 0
> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> apic: apic_set_irq: coalescing 0
> >>
So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
0 it means the interrupt was not delivered.

> >> It looks like some other IRQs cause the coalescing, because also
> >> looking at RTC code, it seems it's not possible for RTC to raise the
> >> IRQ (except update IRQ, alarm etc.) without calling
> >> apic_reset_irq_delivered().
> >>
> >> I've attached my test program. Compile:
> >> gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
> >> coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin
> >>
> >> Run:
> >> qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
> >>
> > The application does not work for me. Looks like it fails to enter
> > protected mode. $pc jumps from 0x00000000fffffff0 to 0x00000000000f003e
> > and back.
> 
> Strange. Here's a working binary.
> 
Your binary works here too. What compiler are you using?

--
			Gleb.
Blue Swirl - May 30, 2010, 4:54 p.m.
On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
>> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> >> apic: write: 00000350 = 00000000
>> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 1
>> >> apic: apic_get_irq_delivered: returning coalescing 1
>> >> apic: apic_reset_irq_delivered: old coalescing 1
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 0
>> >> apic: apic_get_irq_delivered: returning coalescing 0
>> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 0
>> >>
> So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
> 0 it means the interrupt was not delivered.
>
>> >> It looks like some other IRQs cause the coalescing, because also
>> >> looking at RTC code, it seems it's not possible for RTC to raise the
>> >> IRQ (except update IRQ, alarm etc.) without calling
>> >> apic_reset_irq_delivered().
>> >>
>> >> I've attached my test program. Compile:
>> >> gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
>> >> coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin
>> >>
>> >> Run:
>> >> qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
>> >>
>> > The application does not work for me. Looks like it fails to enter
>> > protected mode. $pc jumps from 0x00000000fffffff0 to 0x00000000000f003e
>> > and back.
>>
>> Strange. Here's a working binary.
>>
> Your binary works here too. What compiler are you using?

Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian
4.3.2-1.1' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--enable-mpfr --enable-cld --enable-checking=release
--build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 4.3.2 (Debian 4.3.2-1.1)
Blue Swirl - May 30, 2010, 7:37 p.m.
On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
>> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> >> apic: write: 00000350 = 00000000
>> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 1
>> >> apic: apic_get_irq_delivered: returning coalescing 1
>> >> apic: apic_reset_irq_delivered: old coalescing 1
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 0
>> >> apic: apic_get_irq_delivered: returning coalescing 0
>> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 0
>> >>
> So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
> 0 it means the interrupt was not delivered.

That seems strange. I changed the program so that the handler gets
executed, also output a dot to serial from the handler. I changed the
frequency to 2Hz.

Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
With -rtc-td-hack, the dots come out much faster. I added
DEBUG_COALESCING also to RTC, with that enabled I get:
qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial stdio
cmos: coalesced irqs scaled to 0
cmos: coalesced irqs increased to 1
cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
..cmos: injecting from timer
.cmos: coalesced irqs increased to 2
cmos: injecting on ack

So, there are bogus injections.
Gleb Natapov - May 30, 2010, 8:07 p.m.
On Sun, May 30, 2010 at 07:37:59PM +0000, Blue Swirl wrote:
> On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
> >> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
> >> >> apic: write: 00000350 = 00000000
> >> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> apic: apic_set_irq: coalescing 1
> >> >> apic: apic_get_irq_delivered: returning coalescing 1
> >> >> apic: apic_reset_irq_delivered: old coalescing 1
> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> apic: apic_set_irq: coalescing 0
> >> >> apic: apic_get_irq_delivered: returning coalescing 0
> >> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> apic: apic_set_irq: coalescing 0
> >> >>
> > So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
> > 0 it means the interrupt was not delivered.
> 
> That seems strange. I changed the program so that the handler gets
> executed, also output a dot to serial from the handler. I changed the
> frequency to 2Hz.
> 
> Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
> With -rtc-td-hack, the dots come out much faster. I added
> DEBUG_COALESCING also to RTC, with that enabled I get:
> qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial stdio
> cmos: coalesced irqs scaled to 0
> cmos: coalesced irqs increased to 1
> cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> ..cmos: injecting from timer
> .cmos: coalesced irqs increased to 2
> cmos: injecting on ack
> 
> So, there are bogus injections.

Looks like irr in apic is never cleared. Probably bug in userspace apic
emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
or use qemu-kvm with in kernel irq chip.

--
			Gleb.
Blue Swirl - May 30, 2010, 8:21 p.m.
On Sun, May 30, 2010 at 8:07 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, May 30, 2010 at 07:37:59PM +0000, Blue Swirl wrote:
>> On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> > On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
>> >> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> >> >> apic: write: 00000350 = 00000000
>> >> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> apic: apic_set_irq: coalescing 1
>> >> >> apic: apic_get_irq_delivered: returning coalescing 1
>> >> >> apic: apic_reset_irq_delivered: old coalescing 1
>> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> apic: apic_set_irq: coalescing 0
>> >> >> apic: apic_get_irq_delivered: returning coalescing 0
>> >> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> apic: apic_set_irq: coalescing 0
>> >> >>
>> > So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
>> > 0 it means the interrupt was not delivered.
>>
>> That seems strange. I changed the program so that the handler gets
>> executed, also output a dot to serial from the handler. I changed the
>> frequency to 2Hz.
>>
>> Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
>> With -rtc-td-hack, the dots come out much faster. I added
>> DEBUG_COALESCING also to RTC, with that enabled I get:
>> qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial stdio
>> cmos: coalesced irqs scaled to 0
>> cmos: coalesced irqs increased to 1
>> cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> ..cmos: injecting from timer
>> .cmos: coalesced irqs increased to 2
>> cmos: injecting on ack
>>
>> So, there are bogus injections.
>
> Looks like irr in apic is never cleared. Probably bug in userspace apic
> emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
> or use qemu-kvm with in kernel irq chip.

With APIC you mean Fixed? Then the IRQ is not delivered at all.
Gleb Natapov - May 31, 2010, 5:19 a.m.
On Sun, May 30, 2010 at 08:21:30PM +0000, Blue Swirl wrote:
> On Sun, May 30, 2010 at 8:07 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sun, May 30, 2010 at 07:37:59PM +0000, Blue Swirl wrote:
> >> On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
> >> > On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
> >> >> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
> >> >> >> apic: write: 00000350 = 00000000
> >> >> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> >> apic: apic_set_irq: coalescing 1
> >> >> >> apic: apic_get_irq_delivered: returning coalescing 1
> >> >> >> apic: apic_reset_irq_delivered: old coalescing 1
> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> >> apic: apic_set_irq: coalescing 0
> >> >> >> apic: apic_get_irq_delivered: returning coalescing 0
> >> >> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> >> apic: apic_set_irq: coalescing 0
> >> >> >>
> >> > So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
> >> > 0 it means the interrupt was not delivered.
> >>
> >> That seems strange. I changed the program so that the handler gets
> >> executed, also output a dot to serial from the handler. I changed the
> >> frequency to 2Hz.
> >>
> >> Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
> >> With -rtc-td-hack, the dots come out much faster. I added
> >> DEBUG_COALESCING also to RTC, with that enabled I get:
> >> qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial stdio
> >> cmos: coalesced irqs scaled to 0
> >> cmos: coalesced irqs increased to 1
> >> cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> ..cmos: injecting from timer
> >> .cmos: coalesced irqs increased to 2
> >> cmos: injecting on ack
> >>
> >> So, there are bogus injections.
> >
> > Looks like irr in apic is never cleared. Probably bug in userspace apic
> > emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
> > or use qemu-kvm with in kernel irq chip.
> 
> With APIC you mean Fixed? Then the IRQ is not delivered at all.
You need to deliver it through IOAPIC.

--
			Gleb.
Jan Kiszka - May 31, 2010, 7:46 a.m.
Blue Swirl wrote:
> On Sun, May 30, 2010 at 12:24 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>>> Linux don't use RTC as time source and I don't know about BSD, so no
>>>> Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
>>>> heavy load on the host. You can run video inside the gust to trigger
>>>> coalescing more easily.
>>> I don't have Windows XP, sorry.
>> ReactOS [1], at least its 32-bit version, appears to use the RTC as well.
> 
> I tried LiveCD and QEMU versions, both seem to hang at boot. Is that expected?

Live-CD works for me (in contrast to the pre-installed image in KVM
mode), but it doesn't show the desired RTC usage.

Jan
Blue Swirl - June 1, 2010, 6 p.m.
On Mon, May 31, 2010 at 5:19 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, May 30, 2010 at 08:21:30PM +0000, Blue Swirl wrote:
>> On Sun, May 30, 2010 at 8:07 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> > On Sun, May 30, 2010 at 07:37:59PM +0000, Blue Swirl wrote:
>> >> On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> >> > On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
>> >> >> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> >> >> >> apic: write: 00000350 = 00000000
>> >> >> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> >> apic: apic_set_irq: coalescing 1
>> >> >> >> apic: apic_get_irq_delivered: returning coalescing 1
>> >> >> >> apic: apic_reset_irq_delivered: old coalescing 1
>> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> >> apic: apic_set_irq: coalescing 0
>> >> >> >> apic: apic_get_irq_delivered: returning coalescing 0
>> >> >> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> >> apic: apic_set_irq: coalescing 0
>> >> >> >>
>> >> > So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
>> >> > 0 it means the interrupt was not delivered.
>> >>
>> >> That seems strange. I changed the program so that the handler gets
>> >> executed, also output a dot to serial from the handler. I changed the
>> >> frequency to 2Hz.
>> >>
>> >> Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
>> >> With -rtc-td-hack, the dots come out much faster. I added
>> >> DEBUG_COALESCING also to RTC, with that enabled I get:
>> >> qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial stdio
>> >> cmos: coalesced irqs scaled to 0
>> >> cmos: coalesced irqs increased to 1
>> >> cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> ..cmos: injecting from timer
>> >> .cmos: coalesced irqs increased to 2
>> >> cmos: injecting on ack
>> >>
>> >> So, there are bogus injections.
>> >
>> > Looks like irr in apic is never cleared. Probably bug in userspace apic
>> > emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
>> > or use qemu-kvm with in kernel irq chip.
>>
>> With APIC you mean Fixed? Then the IRQ is not delivered at all.
> You need to deliver it through IOAPIC.

In this version, when USE_APIC is defined, IRQs are routed via IOAPIC
and APIC, PIC interrupts are disabled. Otherwise, IOAPIC and APIC are
left to reset state and PIC is used.

With the PIC version, there are bogus injections. With IOAPIC and APIC
dots come at 2Hz. -enable-kvm has no effect.
Gleb Natapov - June 1, 2010, 6:30 p.m.
On Tue, Jun 01, 2010 at 06:00:20PM +0000, Blue Swirl wrote:
> >> > Looks like irr in apic is never cleared. Probably bug in userspace apic
> >> > emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
> >> > or use qemu-kvm with in kernel irq chip.
> >>
> >> With APIC you mean Fixed? Then the IRQ is not delivered at all.
> > You need to deliver it through IOAPIC.
> 
> In this version, when USE_APIC is defined, IRQs are routed via IOAPIC
> and APIC, PIC interrupts are disabled. Otherwise, IOAPIC and APIC are
> left to reset state and PIC is used.
> 
> With the PIC version, there are bogus injections. With IOAPIC and APIC
> dots come at 2Hz. -enable-kvm has no effect.

I looked into this briefly. Virtual wire case is not handled correctly,
and pic lacks coalescing detection at all (the reason BTW because proper
solution was rejected, so only hack required by Windows was added).
Windows routes RTC interrupts through IOAPIC and this case works as you
can see. -enable-kvm on qemu should not make any difference since it
uses the same userspace pic/ioapic/apic code as TCG. It would be
interesting to try with qemu-kvm which uses in-kernel irq chips by
default.

BTW I managed to compile you test program with gcc 4.2. 4.4 and 4.3
produced non-working binaries for me.

--
			Gleb.
Blue Swirl - June 2, 2010, 7:05 p.m.
On Tue, Jun 1, 2010 at 6:30 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Tue, Jun 01, 2010 at 06:00:20PM +0000, Blue Swirl wrote:
>> >> > Looks like irr in apic is never cleared. Probably bug in userspace apic
>> >> > emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
>> >> > or use qemu-kvm with in kernel irq chip.
>> >>
>> >> With APIC you mean Fixed? Then the IRQ is not delivered at all.
>> > You need to deliver it through IOAPIC.
>>
>> In this version, when USE_APIC is defined, IRQs are routed via IOAPIC
>> and APIC, PIC interrupts are disabled. Otherwise, IOAPIC and APIC are
>> left to reset state and PIC is used.
>>
>> With the PIC version, there are bogus injections. With IOAPIC and APIC
>> dots come at 2Hz. -enable-kvm has no effect.
>
> I looked into this briefly. Virtual wire case is not handled correctly,
> and pic lacks coalescing detection at all (the reason BTW because proper
> solution was rejected, so only hack required by Windows was added).
> Windows routes RTC interrupts through IOAPIC and this case works as you
> can see. -enable-kvm on qemu should not make any difference since it
> uses the same userspace pic/ioapic/apic code as TCG. It would be
> interesting to try with qemu-kvm which uses in-kernel irq chips by
> default.
>
> BTW I managed to compile you test program with gcc 4.2. 4.4 and 4.3
> produced non-working binaries for me.

This exercise showed me that the IRQ connectivity between APIC and RTC
can vary a lot because of guest activities. So it may be impossible to
know at APIC if some IRQ came from RTC and so the coalescing must be
accounted at RTC (or close to RTC). This makes a generic solution less
useful, I think the net result would be to refactor
rtc_coalesced_timer*() to a new file. Also my generic approach of time
base adjustment would need to be steered by RTC because APIC can't. So
I think I'll have to abandon this idea for now. Sorry for the delay,
thanks for the interesting discussion though.

Coming back to $SUBJECT: I don't think the bidirectional IRQ is a good
solution either, but it may be slightly better than current hack (I
didn't realize the full ugliness of the hack before, namely that APIC
and RTC are really not connected at all). The same effect could be
achieved with two signals going to opposite directions, like
APICREQ/APICACK on 82093 IOAPIC but again, if any IRQs point to same
vector at IOAPIC, there is no way to get back to originator. With
bidirectional IRQs, there's a one-shot chance at the delivery time,
but not later at ACK/EOI time.

But how about if we introduced instead a message based IRQ? Then the
message could specify the originator device, maybe ACK/coalesce/NACK
callbacks and a bigger payload than just 1 bit of level. I think that
could achieve the same coalescing effect as what the bidirectional
IRQ. The payload could be useful for other purposes, for example
Sparc64 IRQ messages contain three 64 bit words.
Jan Kiszka - June 3, 2010, 6:23 a.m.
Blue Swirl wrote:
> But how about if we introduced instead a message based IRQ? Then the
> message could specify the originator device, maybe ACK/coalesce/NACK
> callbacks and a bigger payload than just 1 bit of level. I think that
> could achieve the same coalescing effect as what the bidirectional
> IRQ. The payload could be useful for other purposes, for example
> Sparc64 IRQ messages contain three 64 bit words.

If there are more users than just IRQ de-coalescing, this indeed sounds
superior. We could pass objects like this one around:

struct qemu_irq_msg {
	void (*delivery_cb)(int result);
	void *payload;
};

They would be valid within the scope of the IRQ handlers. Whoever
terminates or actually delivers the IRQ would invoke the callback. And
platforms like sparc64 could evaluate the additional payload pointer in
their irqchips or wherever they need it. IRQ routers on platforms that
make use of these messages would have to replicate them when forwarding
an event.

OK?

Jan
Gleb Natapov - June 3, 2010, 6:34 a.m.
On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
> Blue Swirl wrote:
> > But how about if we introduced instead a message based IRQ? Then the
> > message could specify the originator device, maybe ACK/coalesce/NACK
> > callbacks and a bigger payload than just 1 bit of level. I think that
> > could achieve the same coalescing effect as what the bidirectional
> > IRQ. The payload could be useful for other purposes, for example
> > Sparc64 IRQ messages contain three 64 bit words.
> 
> If there are more users than just IRQ de-coalescing, this indeed sounds
> superior. We could pass objects like this one around:
> 
> struct qemu_irq_msg {
> 	void (*delivery_cb)(int result);
> 	void *payload;
> };
> 
> They would be valid within the scope of the IRQ handlers. Whoever
> terminates or actually delivers the IRQ would invoke the callback. And
> platforms like sparc64 could evaluate the additional payload pointer in
> their irqchips or wherever they need it. IRQ routers on platforms that
> make use of these messages would have to replicate them when forwarding
> an event.
> 
> OK?
> 
Let me see if I understand you correctly. qemu_set_irq() will get
additional parameter qemu_irq_msg and if irq was not coalesced
delivery_cb is called, so there is a guaranty that if delivery_cb is
called it is done before qemu_set_irq() returns. Correct?

--
			Gleb.
Jan Kiszka - June 3, 2010, 6:59 a.m.
Gleb Natapov wrote:
> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>> Blue Swirl wrote:
>>> But how about if we introduced instead a message based IRQ? Then the
>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>> could achieve the same coalescing effect as what the bidirectional
>>> IRQ. The payload could be useful for other purposes, for example
>>> Sparc64 IRQ messages contain three 64 bit words.
>> If there are more users than just IRQ de-coalescing, this indeed sounds
>> superior. We could pass objects like this one around:
>>
>> struct qemu_irq_msg {
>> 	void (*delivery_cb)(int result);
>> 	void *payload;
>> };
>>
>> They would be valid within the scope of the IRQ handlers. Whoever
>> terminates or actually delivers the IRQ would invoke the callback. And
>> platforms like sparc64 could evaluate the additional payload pointer in
>> their irqchips or wherever they need it. IRQ routers on platforms that
>> make use of these messages would have to replicate them when forwarding
>> an event.
>>
>> OK?
>>
> Let me see if I understand you correctly. qemu_set_irq() will get
> additional parameter qemu_irq_msg and if irq was not coalesced
> delivery_cb is called, so there is a guaranty that if delivery_cb is
> called it is done before qemu_set_irq() returns. Correct?

If the side that triggers an IRQ passes a message object with a non-NULL
callback, it is supposed to be called unconditionally, passing the
result of the delivery (delivered, masked, coalesced). And yes, the
callback will be invoked in the context of the irq handler, so before
qemu_set_irq (or rather some new qemu_set_irq_msg) returns.

Jan
Gleb Natapov - June 3, 2010, 7:03 a.m.
On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
> Gleb Natapov wrote:
> > On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
> >> Blue Swirl wrote:
> >>> But how about if we introduced instead a message based IRQ? Then the
> >>> message could specify the originator device, maybe ACK/coalesce/NACK
> >>> callbacks and a bigger payload than just 1 bit of level. I think that
> >>> could achieve the same coalescing effect as what the bidirectional
> >>> IRQ. The payload could be useful for other purposes, for example
> >>> Sparc64 IRQ messages contain three 64 bit words.
> >> If there are more users than just IRQ de-coalescing, this indeed sounds
> >> superior. We could pass objects like this one around:
> >>
> >> struct qemu_irq_msg {
> >> 	void (*delivery_cb)(int result);
> >> 	void *payload;
> >> };
> >>
> >> They would be valid within the scope of the IRQ handlers. Whoever
> >> terminates or actually delivers the IRQ would invoke the callback. And
> >> platforms like sparc64 could evaluate the additional payload pointer in
> >> their irqchips or wherever they need it. IRQ routers on platforms that
> >> make use of these messages would have to replicate them when forwarding
> >> an event.
> >>
> >> OK?
> >>
> > Let me see if I understand you correctly. qemu_set_irq() will get
> > additional parameter qemu_irq_msg and if irq was not coalesced
> > delivery_cb is called, so there is a guaranty that if delivery_cb is
> > called it is done before qemu_set_irq() returns. Correct?
> 
> If the side that triggers an IRQ passes a message object with a non-NULL
> callback, it is supposed to be called unconditionally, passing the
> result of the delivery (delivered, masked, coalesced). And yes, the
> callback will be invoked in the context of the irq handler, so before
> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
> 
Looks fine to me.

--
			Gleb.
Gleb Natapov - June 3, 2010, 7:06 a.m.
On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
> > Gleb Natapov wrote:
> > > On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
> > >> Blue Swirl wrote:
> > >>> But how about if we introduced instead a message based IRQ? Then the
> > >>> message could specify the originator device, maybe ACK/coalesce/NACK
> > >>> callbacks and a bigger payload than just 1 bit of level. I think that
> > >>> could achieve the same coalescing effect as what the bidirectional
> > >>> IRQ. The payload could be useful for other purposes, for example
> > >>> Sparc64 IRQ messages contain three 64 bit words.
> > >> If there are more users than just IRQ de-coalescing, this indeed sounds
> > >> superior. We could pass objects like this one around:
> > >>
> > >> struct qemu_irq_msg {
> > >> 	void (*delivery_cb)(int result);
> > >> 	void *payload;
> > >> };
> > >>
> > >> They would be valid within the scope of the IRQ handlers. Whoever
> > >> terminates or actually delivers the IRQ would invoke the callback. And
> > >> platforms like sparc64 could evaluate the additional payload pointer in
> > >> their irqchips or wherever they need it. IRQ routers on platforms that
> > >> make use of these messages would have to replicate them when forwarding
> > >> an event.
> > >>
> > >> OK?
> > >>
> > > Let me see if I understand you correctly. qemu_set_irq() will get
> > > additional parameter qemu_irq_msg and if irq was not coalesced
> > > delivery_cb is called, so there is a guaranty that if delivery_cb is
> > > called it is done before qemu_set_irq() returns. Correct?
> > 
> > If the side that triggers an IRQ passes a message object with a non-NULL
> > callback, it is supposed to be called unconditionally, passing the
> > result of the delivery (delivered, masked, coalesced). And yes, the
> > callback will be invoked in the context of the irq handler, so before
> > qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
> > 
> Looks fine to me.
> 
Except that delivery_cb should probably get pointer to qemu_irq_msg as a
parameter.

--
			Gleb.
Blue Swirl - June 4, 2010, 7:05 p.m.
On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>> > Gleb Natapov wrote:
>> > > On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>> > >> Blue Swirl wrote:
>> > >>> But how about if we introduced instead a message based IRQ? Then the
>> > >>> message could specify the originator device, maybe ACK/coalesce/NACK
>> > >>> callbacks and a bigger payload than just 1 bit of level. I think that
>> > >>> could achieve the same coalescing effect as what the bidirectional
>> > >>> IRQ. The payload could be useful for other purposes, for example
>> > >>> Sparc64 IRQ messages contain three 64 bit words.
>> > >> If there are more users than just IRQ de-coalescing, this indeed sounds
>> > >> superior. We could pass objects like this one around:
>> > >>
>> > >> struct qemu_irq_msg {
>> > >>  void (*delivery_cb)(int result);
>> > >>  void *payload;
>> > >> };
>> > >>
>> > >> They would be valid within the scope of the IRQ handlers. Whoever
>> > >> terminates or actually delivers the IRQ would invoke the callback. And
>> > >> platforms like sparc64 could evaluate the additional payload pointer in
>> > >> their irqchips or wherever they need it. IRQ routers on platforms that
>> > >> make use of these messages would have to replicate them when forwarding
>> > >> an event.
>> > >>
>> > >> OK?
>> > >>
>> > > Let me see if I understand you correctly. qemu_set_irq() will get
>> > > additional parameter qemu_irq_msg and if irq was not coalesced
>> > > delivery_cb is called, so there is a guaranty that if delivery_cb is
>> > > called it is done before qemu_set_irq() returns. Correct?
>> >
>> > If the side that triggers an IRQ passes a message object with a non-NULL
>> > callback, it is supposed to be called unconditionally, passing the
>> > result of the delivery (delivered, masked, coalesced). And yes, the
>> > callback will be invoked in the context of the irq handler, so before
>> > qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>> >
>> Looks fine to me.
>>
> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
> parameter.

I'd like to also support EOI handling. When the guest clears the
interrupt condtion, the EOI callback would be called. This could occur
much later than the IRQ delivery time. I'm not sure if we need the
result code in that case.

If any intermediate device (IOAPIC?) needs to be informed about either
delivery or EOI also, it could create a proxy message with its
callbacks in place. But we need then a separate opaque field (in
addition to payload) to store the original message.

struct IRQMsg {
 DeviceState *src;
 void (*delivery_cb)(IRQMsg *msg, int result);
 void (*eoi_cb)(IRQMsg *msg, int result);
 void *src_opaque;
 void *payload;
};
Jan Kiszka - June 5, 2010, 12:04 a.m.
Blue Swirl wrote:
> On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
>> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>>>> Gleb Natapov wrote:
>>>>> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>>>>>> Blue Swirl wrote:
>>>>>>> But how about if we introduced instead a message based IRQ? Then the
>>>>>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>>>>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>>>>>> could achieve the same coalescing effect as what the bidirectional
>>>>>>> IRQ. The payload could be useful for other purposes, for example
>>>>>>> Sparc64 IRQ messages contain three 64 bit words.
>>>>>> If there are more users than just IRQ de-coalescing, this indeed sounds
>>>>>> superior. We could pass objects like this one around:
>>>>>>
>>>>>> struct qemu_irq_msg {
>>>>>>  void (*delivery_cb)(int result);
>>>>>>  void *payload;
>>>>>> };
>>>>>>
>>>>>> They would be valid within the scope of the IRQ handlers. Whoever
>>>>>> terminates or actually delivers the IRQ would invoke the callback. And
>>>>>> platforms like sparc64 could evaluate the additional payload pointer in
>>>>>> their irqchips or wherever they need it. IRQ routers on platforms that
>>>>>> make use of these messages would have to replicate them when forwarding
>>>>>> an event.
>>>>>>
>>>>>> OK?
>>>>>>
>>>>> Let me see if I understand you correctly. qemu_set_irq() will get
>>>>> additional parameter qemu_irq_msg and if irq was not coalesced
>>>>> delivery_cb is called, so there is a guaranty that if delivery_cb is
>>>>> called it is done before qemu_set_irq() returns. Correct?
>>>> If the side that triggers an IRQ passes a message object with a non-NULL
>>>> callback, it is supposed to be called unconditionally, passing the
>>>> result of the delivery (delivered, masked, coalesced). And yes, the
>>>> callback will be invoked in the context of the irq handler, so before
>>>> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>>>>
>>> Looks fine to me.
>>>
>> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
>> parameter.
> 
> I'd like to also support EOI handling. When the guest clears the
> interrupt condtion, the EOI callback would be called. This could occur
> much later than the IRQ delivery time. I'm not sure if we need the
> result code in that case.
> 
> If any intermediate device (IOAPIC?) needs to be informed about either
> delivery or EOI also, it could create a proxy message with its
> callbacks in place. But we need then a separate opaque field (in
> addition to payload) to store the original message.
> 
> struct IRQMsg {
>  DeviceState *src;
>  void (*delivery_cb)(IRQMsg *msg, int result);
>  void (*eoi_cb)(IRQMsg *msg, int result);
>  void *src_opaque;
>  void *payload;
> };

Extending the lifetime of IRQMsg objects beyond the delivery call stack
means qemu_malloc/free for every delivery. I think it takes a _very_
appealing reason to justify this. But so far I do not see any use case
for eio_cb at all.

Jan
Blue Swirl - June 5, 2010, 7:20 a.m.
On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
>>> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>>>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>>>>> Gleb Natapov wrote:
>>>>>> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>>>>>>> Blue Swirl wrote:
>>>>>>>> But how about if we introduced instead a message based IRQ? Then the
>>>>>>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>>>>>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>>>>>>> could achieve the same coalescing effect as what the bidirectional
>>>>>>>> IRQ. The payload could be useful for other purposes, for example
>>>>>>>> Sparc64 IRQ messages contain three 64 bit words.
>>>>>>> If there are more users than just IRQ de-coalescing, this indeed sounds
>>>>>>> superior. We could pass objects like this one around:
>>>>>>>
>>>>>>> struct qemu_irq_msg {
>>>>>>>  void (*delivery_cb)(int result);
>>>>>>>  void *payload;
>>>>>>> };
>>>>>>>
>>>>>>> They would be valid within the scope of the IRQ handlers. Whoever
>>>>>>> terminates or actually delivers the IRQ would invoke the callback. And
>>>>>>> platforms like sparc64 could evaluate the additional payload pointer in
>>>>>>> their irqchips or wherever they need it. IRQ routers on platforms that
>>>>>>> make use of these messages would have to replicate them when forwarding
>>>>>>> an event.
>>>>>>>
>>>>>>> OK?
>>>>>>>
>>>>>> Let me see if I understand you correctly. qemu_set_irq() will get
>>>>>> additional parameter qemu_irq_msg and if irq was not coalesced
>>>>>> delivery_cb is called, so there is a guaranty that if delivery_cb is
>>>>>> called it is done before qemu_set_irq() returns. Correct?
>>>>> If the side that triggers an IRQ passes a message object with a non-NULL
>>>>> callback, it is supposed to be called unconditionally, passing the
>>>>> result of the delivery (delivered, masked, coalesced). And yes, the
>>>>> callback will be invoked in the context of the irq handler, so before
>>>>> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>>>>>
>>>> Looks fine to me.
>>>>
>>> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
>>> parameter.
>>
>> I'd like to also support EOI handling. When the guest clears the
>> interrupt condtion, the EOI callback would be called. This could occur
>> much later than the IRQ delivery time. I'm not sure if we need the
>> result code in that case.
>>
>> If any intermediate device (IOAPIC?) needs to be informed about either
>> delivery or EOI also, it could create a proxy message with its
>> callbacks in place. But we need then a separate opaque field (in
>> addition to payload) to store the original message.
>>
>> struct IRQMsg {
>>  DeviceState *src;
>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>  void *src_opaque;
>>  void *payload;
>> };
>
> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> means qemu_malloc/free for every delivery. I think it takes a _very_
> appealing reason to justify this. But so far I do not see any use case
> for eio_cb at all.

I think it's safer to use allocation model anyway because this will be
generic code. For example, an intermediate device may want to queue
the IRQs. Alternatively, the callbacks could use DeviceState and some
opaque which can be used as the callback context:
  void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);

EOI can be added later if needed, QEMU seems to work fine now without
it. But based on IOAPIC data sheet, I'd suppose it should be need to
pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
another opportunity to inject IRQs though I guess the guest will ack
the IRQ at the same time for both RTC and APIC.
Jan Kiszka - June 5, 2010, 8:27 a.m.
Blue Swirl wrote:
> On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>> On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
>>>> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>>>>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>>>>>> Gleb Natapov wrote:
>>>>>>> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>>>>>>>> Blue Swirl wrote:
>>>>>>>>> But how about if we introduced instead a message based IRQ? Then the
>>>>>>>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>>>>>>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>>>>>>>> could achieve the same coalescing effect as what the bidirectional
>>>>>>>>> IRQ. The payload could be useful for other purposes, for example
>>>>>>>>> Sparc64 IRQ messages contain three 64 bit words.
>>>>>>>> If there are more users than just IRQ de-coalescing, this indeed sounds
>>>>>>>> superior. We could pass objects like this one around:
>>>>>>>>
>>>>>>>> struct qemu_irq_msg {
>>>>>>>>  void (*delivery_cb)(int result);
>>>>>>>>  void *payload;
>>>>>>>> };
>>>>>>>>
>>>>>>>> They would be valid within the scope of the IRQ handlers. Whoever
>>>>>>>> terminates or actually delivers the IRQ would invoke the callback. And
>>>>>>>> platforms like sparc64 could evaluate the additional payload pointer in
>>>>>>>> their irqchips or wherever they need it. IRQ routers on platforms that
>>>>>>>> make use of these messages would have to replicate them when forwarding
>>>>>>>> an event.
>>>>>>>>
>>>>>>>> OK?
>>>>>>>>
>>>>>>> Let me see if I understand you correctly. qemu_set_irq() will get
>>>>>>> additional parameter qemu_irq_msg and if irq was not coalesced
>>>>>>> delivery_cb is called, so there is a guaranty that if delivery_cb is
>>>>>>> called it is done before qemu_set_irq() returns. Correct?
>>>>>> If the side that triggers an IRQ passes a message object with a non-NULL
>>>>>> callback, it is supposed to be called unconditionally, passing the
>>>>>> result of the delivery (delivered, masked, coalesced). And yes, the
>>>>>> callback will be invoked in the context of the irq handler, so before
>>>>>> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>>>>>>
>>>>> Looks fine to me.
>>>>>
>>>> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
>>>> parameter.
>>> I'd like to also support EOI handling. When the guest clears the
>>> interrupt condtion, the EOI callback would be called. This could occur
>>> much later than the IRQ delivery time. I'm not sure if we need the
>>> result code in that case.
>>>
>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>> delivery or EOI also, it could create a proxy message with its
>>> callbacks in place. But we need then a separate opaque field (in
>>> addition to payload) to store the original message.
>>>
>>> struct IRQMsg {
>>>  DeviceState *src;
>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>  void *src_opaque;
>>>  void *payload;
>>> };
>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>> means qemu_malloc/free for every delivery. I think it takes a _very_
>> appealing reason to justify this. But so far I do not see any use case
>> for eio_cb at all.
> 
> I think it's safer to use allocation model anyway because this will be
> generic code. For example, an intermediate device may want to queue
> the IRQs. Alternatively, the callbacks could use DeviceState and some
> opaque which can be used as the callback context:
>   void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);
> 
> EOI can be added later if needed, QEMU seems to work fine now without
> it. But based on IOAPIC data sheet, I'd suppose it should be need to
> pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
> another opportunity to inject IRQs though I guess the guest will ack
> the IRQ at the same time for both RTC and APIC.

Let's wait for a real use case for an extended IRQMsg lifetime. For now
we are fine with stack-allocated messages which are way simpler to
handle. I'm already drafting a first prototype based on this model.
Switching to dynamic allocation may still happen later on once the
urgent need shows up.

Jan
Blue Swirl - June 5, 2010, 9:23 a.m.
On Sat, Jun 5, 2010 at 8:27 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> Blue Swirl wrote:
>>>> On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
>>>>> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>>>>>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>>>>>>> Gleb Natapov wrote:
>>>>>>>> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>>>>>>>>> Blue Swirl wrote:
>>>>>>>>>> But how about if we introduced instead a message based IRQ? Then the
>>>>>>>>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>>>>>>>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>>>>>>>>> could achieve the same coalescing effect as what the bidirectional
>>>>>>>>>> IRQ. The payload could be useful for other purposes, for example
>>>>>>>>>> Sparc64 IRQ messages contain three 64 bit words.
>>>>>>>>> If there are more users than just IRQ de-coalescing, this indeed sounds
>>>>>>>>> superior. We could pass objects like this one around:
>>>>>>>>>
>>>>>>>>> struct qemu_irq_msg {
>>>>>>>>>  void (*delivery_cb)(int result);
>>>>>>>>>  void *payload;
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> They would be valid within the scope of the IRQ handlers. Whoever
>>>>>>>>> terminates or actually delivers the IRQ would invoke the callback. And
>>>>>>>>> platforms like sparc64 could evaluate the additional payload pointer in
>>>>>>>>> their irqchips or wherever they need it. IRQ routers on platforms that
>>>>>>>>> make use of these messages would have to replicate them when forwarding
>>>>>>>>> an event.
>>>>>>>>>
>>>>>>>>> OK?
>>>>>>>>>
>>>>>>>> Let me see if I understand you correctly. qemu_set_irq() will get
>>>>>>>> additional parameter qemu_irq_msg and if irq was not coalesced
>>>>>>>> delivery_cb is called, so there is a guaranty that if delivery_cb is
>>>>>>>> called it is done before qemu_set_irq() returns. Correct?
>>>>>>> If the side that triggers an IRQ passes a message object with a non-NULL
>>>>>>> callback, it is supposed to be called unconditionally, passing the
>>>>>>> result of the delivery (delivered, masked, coalesced). And yes, the
>>>>>>> callback will be invoked in the context of the irq handler, so before
>>>>>>> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>>>>>>>
>>>>>> Looks fine to me.
>>>>>>
>>>>> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
>>>>> parameter.
>>>> I'd like to also support EOI handling. When the guest clears the
>>>> interrupt condtion, the EOI callback would be called. This could occur
>>>> much later than the IRQ delivery time. I'm not sure if we need the
>>>> result code in that case.
>>>>
>>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>>> delivery or EOI also, it could create a proxy message with its
>>>> callbacks in place. But we need then a separate opaque field (in
>>>> addition to payload) to store the original message.
>>>>
>>>> struct IRQMsg {
>>>>  DeviceState *src;
>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>>  void *src_opaque;
>>>>  void *payload;
>>>> };
>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>>> means qemu_malloc/free for every delivery. I think it takes a _very_
>>> appealing reason to justify this. But so far I do not see any use case
>>> for eio_cb at all.
>>
>> I think it's safer to use allocation model anyway because this will be
>> generic code. For example, an intermediate device may want to queue
>> the IRQs. Alternatively, the callbacks could use DeviceState and some
>> opaque which can be used as the callback context:
>>   void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);
>>
>> EOI can be added later if needed, QEMU seems to work fine now without
>> it. But based on IOAPIC data sheet, I'd suppose it should be need to
>> pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
>> another opportunity to inject IRQs though I guess the guest will ack
>> the IRQ at the same time for both RTC and APIC.
>
> Let's wait for a real use case for an extended IRQMsg lifetime. For now
> we are fine with stack-allocated messages which are way simpler to
> handle. I'm already drafting a first prototype based on this model.
> Switching to dynamic allocation may still happen later on once the
> urgent need shows up.

Passing around stack allocated objects is asking for trouble. I'd much
rather use the DeviceState/opaque version then, so at least
destination should not need to use IRQMsg for anything.
Jan Kiszka - June 5, 2010, 12:14 p.m.
Blue Swirl wrote:
> On Sat, Jun 5, 2010 at 8:27 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>> On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>> Blue Swirl wrote:
>>>>> On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
>>>>>> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>>>>>>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>>>>>>>> Gleb Natapov wrote:
>>>>>>>>> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>>>>>>>>>> Blue Swirl wrote:
>>>>>>>>>>> But how about if we introduced instead a message based IRQ? Then the
>>>>>>>>>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>>>>>>>>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>>>>>>>>>> could achieve the same coalescing effect as what the bidirectional
>>>>>>>>>>> IRQ. The payload could be useful for other purposes, for example
>>>>>>>>>>> Sparc64 IRQ messages contain three 64 bit words.
>>>>>>>>>> If there are more users than just IRQ de-coalescing, this indeed sounds
>>>>>>>>>> superior. We could pass objects like this one around:
>>>>>>>>>>
>>>>>>>>>> struct qemu_irq_msg {
>>>>>>>>>>  void (*delivery_cb)(int result);
>>>>>>>>>>  void *payload;
>>>>>>>>>> };
>>>>>>>>>>
>>>>>>>>>> They would be valid within the scope of the IRQ handlers. Whoever
>>>>>>>>>> terminates or actually delivers the IRQ would invoke the callback. And
>>>>>>>>>> platforms like sparc64 could evaluate the additional payload pointer in
>>>>>>>>>> their irqchips or wherever they need it. IRQ routers on platforms that
>>>>>>>>>> make use of these messages would have to replicate them when forwarding
>>>>>>>>>> an event.
>>>>>>>>>>
>>>>>>>>>> OK?
>>>>>>>>>>
>>>>>>>>> Let me see if I understand you correctly. qemu_set_irq() will get
>>>>>>>>> additional parameter qemu_irq_msg and if irq was not coalesced
>>>>>>>>> delivery_cb is called, so there is a guaranty that if delivery_cb is
>>>>>>>>> called it is done before qemu_set_irq() returns. Correct?
>>>>>>>> If the side that triggers an IRQ passes a message object with a non-NULL
>>>>>>>> callback, it is supposed to be called unconditionally, passing the
>>>>>>>> result of the delivery (delivered, masked, coalesced). And yes, the
>>>>>>>> callback will be invoked in the context of the irq handler, so before
>>>>>>>> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>>>>>>>>
>>>>>>> Looks fine to me.
>>>>>>>
>>>>>> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
>>>>>> parameter.
>>>>> I'd like to also support EOI handling. When the guest clears the
>>>>> interrupt condtion, the EOI callback would be called. This could occur
>>>>> much later than the IRQ delivery time. I'm not sure if we need the
>>>>> result code in that case.
>>>>>
>>>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>>>> delivery or EOI also, it could create a proxy message with its
>>>>> callbacks in place. But we need then a separate opaque field (in
>>>>> addition to payload) to store the original message.
>>>>>
>>>>> struct IRQMsg {
>>>>>  DeviceState *src;
>>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>>>  void *src_opaque;
>>>>>  void *payload;
>>>>> };
>>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>>>> means qemu_malloc/free for every delivery. I think it takes a _very_
>>>> appealing reason to justify this. But so far I do not see any use case
>>>> for eio_cb at all.
>>> I think it's safer to use allocation model anyway because this will be
>>> generic code. For example, an intermediate device may want to queue
>>> the IRQs. Alternatively, the callbacks could use DeviceState and some
>>> opaque which can be used as the callback context:
>>>   void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);
>>>
>>> EOI can be added later if needed, QEMU seems to work fine now without
>>> it. But based on IOAPIC data sheet, I'd suppose it should be need to
>>> pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
>>> another opportunity to inject IRQs though I guess the guest will ack
>>> the IRQ at the same time for both RTC and APIC.
>> Let's wait for a real use case for an extended IRQMsg lifetime. For now
>> we are fine with stack-allocated messages which are way simpler to
>> handle. I'm already drafting a first prototype based on this model.
>> Switching to dynamic allocation may still happen later on once the
>> urgent need shows up.
> 
> Passing around stack allocated objects is asking for trouble. I'd much
> rather use the DeviceState/opaque version then, so at least
> destination should not need to use IRQMsg for anything.

Right, I've hiden the IRQMsg object from the target handler, temporarily
storing it in qemu_irq instead. qemu_irq_handler had to be touched
anyway, so I'm now passing the IRQ object to it so that it can invoke
services to trigger the delivery callback or obtain the payload.

Jan
Gleb Natapov - June 6, 2010, 7:15 a.m.
On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
> > I'd like to also support EOI handling. When the guest clears the
> > interrupt condtion, the EOI callback would be called. This could occur
> > much later than the IRQ delivery time. I'm not sure if we need the
> > result code in that case.
> > 
> > If any intermediate device (IOAPIC?) needs to be informed about either
> > delivery or EOI also, it could create a proxy message with its
> > callbacks in place. But we need then a separate opaque field (in
> > addition to payload) to store the original message.
> > 
> > struct IRQMsg {
> >  DeviceState *src;
> >  void (*delivery_cb)(IRQMsg *msg, int result);
> >  void (*eoi_cb)(IRQMsg *msg, int result);
> >  void *src_opaque;
> >  void *payload;
> > };
> 
> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> means qemu_malloc/free for every delivery. I think it takes a _very_
> appealing reason to justify this. But so far I do not see any use case
> for eio_cb at all.
> 
I dislike use of eoi for reinfecting missing interrupts since
it eliminates use of internal PIC/APIC queue of not yet delivered
interrupts. PIC and APIC has internal queue that can handle two elements:
one is delivered, but not yet acked interrupt in isr and another is
pending interrupt in irr. Using eoi callback (or ack notifier as it's
called inside kernel) interrupt will be considered coalesced even if irr
is cleared, but no ack was received for previously delivered interrupt.
But ack notifiers actually has another use: device assignment. There is
a plan to move device assignment from kernel to userspace and for that
ack notifiers will have to be extended to userspace too. If so we can
use them to do irq decoalescing as well. I doubt they should be part
of IRQMsg though. Why not do what kernel does: have globally registered
notifier based on irqchip/pin.

--
			Gleb.
Jan Kiszka - June 6, 2010, 7:39 a.m.
Gleb Natapov wrote:
> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
>>> I'd like to also support EOI handling. When the guest clears the
>>> interrupt condtion, the EOI callback would be called. This could occur
>>> much later than the IRQ delivery time. I'm not sure if we need the
>>> result code in that case.
>>>
>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>> delivery or EOI also, it could create a proxy message with its
>>> callbacks in place. But we need then a separate opaque field (in
>>> addition to payload) to store the original message.
>>>
>>> struct IRQMsg {
>>>  DeviceState *src;
>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>  void *src_opaque;
>>>  void *payload;
>>> };
>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>> means qemu_malloc/free for every delivery. I think it takes a _very_
>> appealing reason to justify this. But so far I do not see any use case
>> for eio_cb at all.
>>
> I dislike use of eoi for reinfecting missing interrupts since
> it eliminates use of internal PIC/APIC queue of not yet delivered
> interrupts. PIC and APIC has internal queue that can handle two elements:
> one is delivered, but not yet acked interrupt in isr and another is
> pending interrupt in irr. Using eoi callback (or ack notifier as it's
> called inside kernel) interrupt will be considered coalesced even if irr
> is cleared, but no ack was received for previously delivered interrupt.
> But ack notifiers actually has another use: device assignment. There is
> a plan to move device assignment from kernel to userspace and for that
> ack notifiers will have to be extended to userspace too. If so we can
> use them to do irq decoalescing as well. I doubt they should be part
> of IRQMsg though. Why not do what kernel does: have globally registered
> notifier based on irqchip/pin.

I read this twice but I still don't get your plan. Do you like or
dislike using EIO for de-coalescing? And how should these notifiers work?

Jan
Blue Swirl - June 6, 2010, 7:39 a.m.
On Sun, Jun 6, 2010 at 7:15 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
>> > I'd like to also support EOI handling. When the guest clears the
>> > interrupt condtion, the EOI callback would be called. This could occur
>> > much later than the IRQ delivery time. I'm not sure if we need the
>> > result code in that case.
>> >
>> > If any intermediate device (IOAPIC?) needs to be informed about either
>> > delivery or EOI also, it could create a proxy message with its
>> > callbacks in place. But we need then a separate opaque field (in
>> > addition to payload) to store the original message.
>> >
>> > struct IRQMsg {
>> >  DeviceState *src;
>> >  void (*delivery_cb)(IRQMsg *msg, int result);
>> >  void (*eoi_cb)(IRQMsg *msg, int result);
>> >  void *src_opaque;
>> >  void *payload;
>> > };
>>
>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>> means qemu_malloc/free for every delivery. I think it takes a _very_
>> appealing reason to justify this. But so far I do not see any use case
>> for eio_cb at all.
>>
> I dislike use of eoi for reinfecting missing interrupts since
> it eliminates use of internal PIC/APIC queue of not yet delivered
> interrupts. PIC and APIC has internal queue that can handle two elements:
> one is delivered, but not yet acked interrupt in isr and another is
> pending interrupt in irr. Using eoi callback (or ack notifier as it's
> called inside kernel) interrupt will be considered coalesced even if irr
> is cleared, but no ack was received for previously delivered interrupt.
> But ack notifiers actually has another use: device assignment. There is
> a plan to move device assignment from kernel to userspace and for that
> ack notifiers will have to be extended to userspace too. If so we can
> use them to do irq decoalescing as well. I doubt they should be part
> of IRQMsg though. Why not do what kernel does: have globally registered
> notifier based on irqchip/pin.

Because translation at IOAPIC may be lossy, IRQs from many devices
pointing to the same vector? With IRQMsg you know where a specific
message came from. The situation is different inside the kernel: it
manages both translation and registration, whereas in QEMU we could
only control registration.
Gleb Natapov - June 6, 2010, 7:49 a.m.
On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
> Gleb Natapov wrote:
> > On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
> >>> I'd like to also support EOI handling. When the guest clears the
> >>> interrupt condtion, the EOI callback would be called. This could occur
> >>> much later than the IRQ delivery time. I'm not sure if we need the
> >>> result code in that case.
> >>>
> >>> If any intermediate device (IOAPIC?) needs to be informed about either
> >>> delivery or EOI also, it could create a proxy message with its
> >>> callbacks in place. But we need then a separate opaque field (in
> >>> addition to payload) to store the original message.
> >>>
> >>> struct IRQMsg {
> >>>  DeviceState *src;
> >>>  void (*delivery_cb)(IRQMsg *msg, int result);
> >>>  void (*eoi_cb)(IRQMsg *msg, int result);
> >>>  void *src_opaque;
> >>>  void *payload;
> >>> };
> >> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> >> means qemu_malloc/free for every delivery. I think it takes a _very_
> >> appealing reason to justify this. But so far I do not see any use case
> >> for eio_cb at all.
> >>
> > I dislike use of eoi for reinfecting missing interrupts since
> > it eliminates use of internal PIC/APIC queue of not yet delivered
> > interrupts. PIC and APIC has internal queue that can handle two elements:
> > one is delivered, but not yet acked interrupt in isr and another is
> > pending interrupt in irr. Using eoi callback (or ack notifier as it's
> > called inside kernel) interrupt will be considered coalesced even if irr
> > is cleared, but no ack was received for previously delivered interrupt.
> > But ack notifiers actually has another use: device assignment. There is
> > a plan to move device assignment from kernel to userspace and for that
> > ack notifiers will have to be extended to userspace too. If so we can
> > use them to do irq decoalescing as well. I doubt they should be part
> > of IRQMsg though. Why not do what kernel does: have globally registered
> > notifier based on irqchip/pin.
> 
> I read this twice but I still don't get your plan. Do you like or
> dislike using EIO for de-coalescing? And how should these notifiers work?
> 
That's because I confused myself :) I _dislike_ them to be used, but
since device assignment requires ack notifiers anyway may be it is better
to introduce one mechanism for device assignmen + de-coalescing instead
of introducing two different mechanism. Using ack notifiers should be
easy: RTC registers ack notifier and keep track of delivered interrupts.
If timer triggers after previews irq was set, but before it was acked
coalesced counter is incremented. In ack notifier callback coalesced
counter is checked and if it is not zero new irq is set.

--
			Gleb.
Gleb Natapov - June 6, 2010, 8:07 a.m.
On Sun, Jun 06, 2010 at 07:39:49AM +0000, Blue Swirl wrote:
> On Sun, Jun 6, 2010 at 7:15 AM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
> >> > I'd like to also support EOI handling. When the guest clears the
> >> > interrupt condtion, the EOI callback would be called. This could occur
> >> > much later than the IRQ delivery time. I'm not sure if we need the
> >> > result code in that case.
> >> >
> >> > If any intermediate device (IOAPIC?) needs to be informed about either
> >> > delivery or EOI also, it could create a proxy message with its
> >> > callbacks in place. But we need then a separate opaque field (in
> >> > addition to payload) to store the original message.
> >> >
> >> > struct IRQMsg {
> >> >  DeviceState *src;
> >> >  void (*delivery_cb)(IRQMsg *msg, int result);
> >> >  void (*eoi_cb)(IRQMsg *msg, int result);
> >> >  void *src_opaque;
> >> >  void *payload;
> >> > };
> >>
> >> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> >> means qemu_malloc/free for every delivery. I think it takes a _very_
> >> appealing reason to justify this. But so far I do not see any use case
> >> for eio_cb at all.
> >>
> > I dislike use of eoi for reinfecting missing interrupts since
> > it eliminates use of internal PIC/APIC queue of not yet delivered
> > interrupts. PIC and APIC has internal queue that can handle two elements:
> > one is delivered, but not yet acked interrupt in isr and another is
> > pending interrupt in irr. Using eoi callback (or ack notifier as it's
> > called inside kernel) interrupt will be considered coalesced even if irr
> > is cleared, but no ack was received for previously delivered interrupt.
> > But ack notifiers actually has another use: device assignment. There is
> > a plan to move device assignment from kernel to userspace and for that
> > ack notifiers will have to be extended to userspace too. If so we can
> > use them to do irq decoalescing as well. I doubt they should be part
> > of IRQMsg though. Why not do what kernel does: have globally registered
> > notifier based on irqchip/pin.
> 
> Because translation at IOAPIC may be lossy, IRQs from many devices
> pointing to the same vector? With IRQMsg you know where a specific
> message came from. The situation is different inside the kernel: it
> manages both translation and registration, whereas in QEMU we could
> only control registration.
Configuring IOAPIC like that is against x86 architecture. OS will not be
able to map from interrupt vector back to device.

--
			Gleb.
Jan Kiszka - June 6, 2010, 8:07 a.m.
Gleb Natapov wrote:
> On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
>> Gleb Natapov wrote:
>>> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
>>>>> I'd like to also support EOI handling. When the guest clears the
>>>>> interrupt condtion, the EOI callback would be called. This could occur
>>>>> much later than the IRQ delivery time. I'm not sure if we need the
>>>>> result code in that case.
>>>>>
>>>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>>>> delivery or EOI also, it could create a proxy message with its
>>>>> callbacks in place. But we need then a separate opaque field (in
>>>>> addition to payload) to store the original message.
>>>>>
>>>>> struct IRQMsg {
>>>>>  DeviceState *src;
>>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>>>  void *src_opaque;
>>>>>  void *payload;
>>>>> };
>>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>>>> means qemu_malloc/free for every delivery. I think it takes a _very_
>>>> appealing reason to justify this. But so far I do not see any use case
>>>> for eio_cb at all.
>>>>
>>> I dislike use of eoi for reinfecting missing interrupts since
>>> it eliminates use of internal PIC/APIC queue of not yet delivered
>>> interrupts. PIC and APIC has internal queue that can handle two elements:
>>> one is delivered, but not yet acked interrupt in isr and another is
>>> pending interrupt in irr. Using eoi callback (or ack notifier as it's
>>> called inside kernel) interrupt will be considered coalesced even if irr
>>> is cleared, but no ack was received for previously delivered interrupt.
>>> But ack notifiers actually has another use: device assignment. There is
>>> a plan to move device assignment from kernel to userspace and for that
>>> ack notifiers will have to be extended to userspace too. If so we can
>>> use them to do irq decoalescing as well. I doubt they should be part
>>> of IRQMsg though. Why not do what kernel does: have globally registered
>>> notifier based on irqchip/pin.
>> I read this twice but I still don't get your plan. Do you like or
>> dislike using EIO for de-coalescing? And how should these notifiers work?
>>
> That's because I confused myself :) I _dislike_ them to be used, but
> since device assignment requires ack notifiers anyway may be it is better
> to introduce one mechanism for device assignmen + de-coalescing instead
> of introducing two different mechanism. Using ack notifiers should be
> easy: RTC registers ack notifier and keep track of delivered interrupts.
> If timer triggers after previews irq was set, but before it was acked
> coalesced counter is incremented. In ack notifier callback coalesced
> counter is checked and if it is not zero new irq is set.

Ack notifier registrations and event deliveries still need to be routed.
Piggy-backing this on IRQ messages may be unavoidable for that reason.

Anyway, I'm going to post my HPET updates with the infrastructure for
IRQMsg now. Maybe it's helpful to see the other option in reality.

Jan
Gleb Natapov - June 6, 2010, 9:23 a.m.
On Sun, Jun 06, 2010 at 10:07:48AM +0200, Jan Kiszka wrote:
> Gleb Natapov wrote:
> > On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
> >> Gleb Natapov wrote:
> >>> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
> >>>>> I'd like to also support EOI handling. When the guest clears the
> >>>>> interrupt condtion, the EOI callback would be called. This could occur
> >>>>> much later than the IRQ delivery time. I'm not sure if we need the
> >>>>> result code in that case.
> >>>>>
> >>>>> If any intermediate device (IOAPIC?) needs to be informed about either
> >>>>> delivery or EOI also, it could create a proxy message with its
> >>>>> callbacks in place. But we need then a separate opaque field (in
> >>>>> addition to payload) to store the original message.
> >>>>>
> >>>>> struct IRQMsg {
> >>>>>  DeviceState *src;
> >>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
> >>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
> >>>>>  void *src_opaque;
> >>>>>  void *payload;
> >>>>> };
> >>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> >>>> means qemu_malloc/free for every delivery. I think it takes a _very_
> >>>> appealing reason to justify this. But so far I do not see any use case
> >>>> for eio_cb at all.
> >>>>
> >>> I dislike use of eoi for reinfecting missing interrupts since
> >>> it eliminates use of internal PIC/APIC queue of not yet delivered
> >>> interrupts. PIC and APIC has internal queue that can handle two elements:
> >>> one is delivered, but not yet acked interrupt in isr and another is
> >>> pending interrupt in irr. Using eoi callback (or ack notifier as it's
> >>> called inside kernel) interrupt will be considered coalesced even if irr
> >>> is cleared, but no ack was received for previously delivered interrupt.
> >>> But ack notifiers actually has another use: device assignment. There is
> >>> a plan to move device assignment from kernel to userspace and for that
> >>> ack notifiers will have to be extended to userspace too. If so we can
> >>> use them to do irq decoalescing as well. I doubt they should be part
> >>> of IRQMsg though. Why not do what kernel does: have globally registered
> >>> notifier based on irqchip/pin.
> >> I read this twice but I still don't get your plan. Do you like or
> >> dislike using EIO for de-coalescing? And how should these notifiers work?
> >>
> > That's because I confused myself :) I _dislike_ them to be used, but
> > since device assignment requires ack notifiers anyway may be it is better
> > to introduce one mechanism for device assignmen + de-coalescing instead
> > of introducing two different mechanism. Using ack notifiers should be
> > easy: RTC registers ack notifier and keep track of delivered interrupts.
> > If timer triggers after previews irq was set, but before it was acked
> > coalesced counter is incremented. In ack notifier callback coalesced
> > counter is checked and if it is not zero new irq is set.
> 
> Ack notifier registrations and event deliveries still need to be routed.
> Piggy-backing this on IRQ messages may be unavoidable for that reason.
It is done in the kernel without piggy-backing.

> 
> Anyway, I'm going to post my HPET updates with the infrastructure for
> IRQMsg now. Maybe it's helpful to see the other option in reality.
> 
One other think to consider current approach does not always work.
Win2K3-64bit-smp and Win2k8-64bit-smp configure RTC interrupt to be
broadcasted to all cpus, but only boot cpu does time calculation. With
current approach if interrupt is delivered to at least one vcpu
it will not be considered coalesced, but if cpu it was delivered to is
not cpu that does time accounting then clock will drift.

--
			Gleb.
Jan Kiszka - June 6, 2010, 10:10 a.m.
Gleb Natapov wrote:
> On Sun, Jun 06, 2010 at 10:07:48AM +0200, Jan Kiszka wrote:
>> Gleb Natapov wrote:
>>> On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
>>>> Gleb Natapov wrote:
>>>>> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
>>>>>>> I'd like to also support EOI handling. When the guest clears the
>>>>>>> interrupt condtion, the EOI callback would be called. This could occur
>>>>>>> much later than the IRQ delivery time. I'm not sure if we need the
>>>>>>> result code in that case.
>>>>>>>
>>>>>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>>>>>> delivery or EOI also, it could create a proxy message with its
>>>>>>> callbacks in place. But we need then a separate opaque field (in
>>>>>>> addition to payload) to store the original message.
>>>>>>>
>>>>>>> struct IRQMsg {
>>>>>>>  DeviceState *src;
>>>>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>>>>>  void *src_opaque;
>>>>>>>  void *payload;
>>>>>>> };
>>>>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>>>>>> means qemu_malloc/free for every delivery. I think it takes a _very_
>>>>>> appealing reason to justify this. But so far I do not see any use case
>>>>>> for eio_cb at all.
>>>>>>
>>>>> I dislike use of eoi for reinfecting missing interrupts since
>>>>> it eliminates use of internal PIC/APIC queue of not yet delivered
>>>>> interrupts. PIC and APIC has internal queue that can handle two elements:
>>>>> one is delivered, but not yet acked interrupt in isr and another is
>>>>> pending interrupt in irr. Using eoi callback (or ack notifier as it's
>>>>> called inside kernel) interrupt will be considered coalesced even if irr
>>>>> is cleared, but no ack was received for previously delivered interrupt.
>>>>> But ack notifiers actually has another use: device assignment. There is
>>>>> a plan to move device assignment from kernel to userspace and for that
>>>>> ack notifiers will have to be extended to userspace too. If so we can
>>>>> use them to do irq decoalescing as well. I doubt they should be part
>>>>> of IRQMsg though. Why not do what kernel does: have globally registered
>>>>> notifier based on irqchip/pin.
>>>> I read this twice but I still don't get your plan. Do you like or
>>>> dislike using EIO for de-coalescing? And how should these notifiers work?
>>>>
>>> That's because I confused myself :) I _dislike_ them to be used, but
>>> since device assignment requires ack notifiers anyway may be it is better
>>> to introduce one mechanism for device assignmen + de-coalescing instead
>>> of introducing two different mechanism. Using ack notifiers should be
>>> easy: RTC registers ack notifier and keep track of delivered interrupts.
>>> If timer triggers after previews irq was set, but before it was acked
>>> coalesced counter is incremented. In ack notifier callback coalesced
>>> counter is checked and if it is not zero new irq is set.
>> Ack notifier registrations and event deliveries still need to be routed.
>> Piggy-backing this on IRQ messages may be unavoidable for that reason.
> It is done in the kernel without piggy-backing.

As it does not include any IRQ routers in front of the interrupt
controller. Maybe it works for x86, but it is no generic solution.

Also, periodic timer sources get no information about the fact that
their interrupt is masked somewhere along the path to the VCPUs and will
possibly replay countless IRQs when the masking ends, no?

> 
>> Anyway, I'm going to post my HPET updates with the infrastructure for
>> IRQMsg now. Maybe it's helpful to see the other option in reality.
>>
> One other think to consider current approach does not always work.
> Win2K3-64bit-smp and Win2k8-64bit-smp configure RTC interrupt to be
> broadcasted to all cpus, but only boot cpu does time calculation. With
> current approach if interrupt is delivered to at least one vcpu
> it will not be considered coalesced, but if cpu it was delivered to is
> not cpu that does time accounting then clock will drift.

That means we would have to fire callbacks per receiving CPU and report
its number back. Is there a way to find out if we are running such a
guest without an '-enable-win2k[38]-64bit-smp-rtc-drift-fix'?

Jan
Gleb Natapov - June 6, 2010, 10:27 a.m.
On Sun, Jun 06, 2010 at 12:10:07PM +0200, Jan Kiszka wrote:
> Gleb Natapov wrote:
> > On Sun, Jun 06, 2010 at 10:07:48AM +0200, Jan Kiszka wrote:
> >> Gleb Natapov wrote:
> >>> On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
> >>>> Gleb Natapov wrote:
> >>>>> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
> >>>>>>> I'd like to also support EOI handling. When the guest clears the
> >>>>>>> interrupt condtion, the EOI callback would be called. This could occur
> >>>>>>> much later than the IRQ delivery time. I'm not sure if we need the
> >>>>>>> result code in that case.
> >>>>>>>
> >>>>>>> If any intermediate device (IOAPIC?) needs to be informed about either
> >>>>>>> delivery or EOI also, it could create a proxy message with its
> >>>>>>> callbacks in place. But we need then a separate opaque field (in
> >>>>>>> addition to payload) to store the original message.
> >>>>>>>
> >>>>>>> struct IRQMsg {
> >>>>>>>  DeviceState *src;
> >>>>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
> >>>>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
> >>>>>>>  void *src_opaque;
> >>>>>>>  void *payload;
> >>>>>>> };
> >>>>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> >>>>>> means qemu_malloc/free for every delivery. I think it takes a _very_
> >>>>>> appealing reason to justify this. But so far I do not see any use case
> >>>>>> for eio_cb at all.
> >>>>>>
> >>>>> I dislike use of eoi for reinfecting missing interrupts since
> >>>>> it eliminates use of internal PIC/APIC queue of not yet delivered
> >>>>> interrupts. PIC and APIC has internal queue that can handle two elements:
> >>>>> one is delivered, but not yet acked interrupt in isr and another is
> >>>>> pending interrupt in irr. Using eoi callback (or ack notifier as it's
> >>>>> called inside kernel) interrupt will be considered coalesced even if irr
> >>>>> is cleared, but no ack was received for previously delivered interrupt.
> >>>>> But ack notifiers actually has another use: device assignment. There is
> >>>>> a plan to move device assignment from kernel to userspace and for that
> >>>>> ack notifiers will have to be extended to userspace too. If so we can
> >>>>> use them to do irq decoalescing as well. I doubt they should be part
> >>>>> of IRQMsg though. Why not do what kernel does: have globally registered
> >>>>> notifier based on irqchip/pin.
> >>>> I read this twice but I still don't get your plan. Do you like or
> >>>> dislike using EIO for de-coalescing? And how should these notifiers work?
> >>>>
> >>> That's because I confused myself :) I _dislike_ them to be used, but
> >>> since device assignment requires ack notifiers anyway may be it is better
> >>> to introduce one mechanism for device assignmen + de-coalescing instead
> >>> of introducing two different mechanism. Using ack notifiers should be
> >>> easy: RTC registers ack notifier and keep track of delivered interrupts.
> >>> If timer triggers after previews irq was set, but before it was acked
> >>> coalesced counter is incremented. In ack notifier callback coalesced
> >>> counter is checked and if it is not zero new irq is set.
> >> Ack notifier registrations and event deliveries still need to be routed.
> >> Piggy-backing this on IRQ messages may be unavoidable for that reason.
> > It is done in the kernel without piggy-backing.
> 
> As it does not include any IRQ routers in front of the interrupt
> controller. Maybe it works for x86, but it is no generic solution.
> 
x86 has IRQ router in front of interrupt controller inside pci host
bridge.

> Also, periodic timer sources get no information about the fact that
> their interrupt is masked somewhere along the path to the VCPUs and will
> possibly replay countless IRQs when the masking ends, no?
> 
Correct, for that we have mask notifiers in the kernel. Gets ugly be the
minute.

> > 
> >> Anyway, I'm going to post my HPET updates with the infrastructure for
> >> IRQMsg now. Maybe it's helpful to see the other option in reality.
> >>
> > One other think to consider current approach does not always work.
> > Win2K3-64bit-smp and Win2k8-64bit-smp configure RTC interrupt to be
> > broadcasted to all cpus, but only boot cpu does time calculation. With
> > current approach if interrupt is delivered to at least one vcpu
> > it will not be considered coalesced, but if cpu it was delivered to is
> > not cpu that does time accounting then clock will drift.
> 
> That means we would have to fire callbacks per receiving CPU and report
> its number back. Is there a way to find out if we are running such a
> guest without an '-enable-win2k[38]-64bit-smp-rtc-drift-fix'?
> 
Not that I know of.

--
			Gleb.

Patch

diff --git a/hw/irq.c b/hw/irq.c
index 7703f62..db2cce6 100644
--- a/hw/irq.c
+++ b/hw/irq.c
@@ -26,19 +26,27 @@ 
 
 struct IRQState {
     qemu_irq_handler handler;
+    qemu_irq_fb_handler feedback_handler;
     void *opaque;
     int n;
 };
 
-void qemu_set_irq(qemu_irq irq, int level)
+int qemu_set_irq(qemu_irq irq, int level)
 {
-    if (!irq)
-        return;
-
-    irq->handler(irq->opaque, irq->n, level);
+    if (!irq) {
+        return 0;
+    }
+    if (irq->feedback_handler) {
+        return irq->feedback_handler(irq->opaque, irq->n, level);
+    } else {
+        irq->handler(irq->opaque, irq->n, level);
+        return QEMU_IRQ_DELIVERED;
+    }
 }
 
-qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
+static qemu_irq *allocate_irqs(qemu_irq_handler handler,
+                               qemu_irq_fb_handler feedback_handler,
+                               void *opaque, int n)
 {
     qemu_irq *s;
     struct IRQState *p;
@@ -48,6 +56,7 @@  qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
     p = (struct IRQState *)qemu_mallocz(sizeof(struct IRQState) * n);
     for (i = 0; i < n; i++) {
         p->handler = handler;
+        p->feedback_handler = feedback_handler;
         p->opaque = opaque;
         p->n = i;
         s[i] = p;
@@ -56,22 +65,33 @@  qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
     return s;
 }
 
+qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
+{
+    return allocate_irqs(handler, NULL, opaque, n);
+}
+
+qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
+                                      void *opaque, int n)
+{
+    return allocate_irqs(NULL, handler, opaque, n);
+}
+
 void qemu_free_irqs(qemu_irq *s)
 {
     qemu_free(s[0]);
     qemu_free(s);
 }
 
-static void qemu_notirq(void *opaque, int line, int level)
+static int qemu_notirq(void *opaque, int line, int level)
 {
     struct IRQState *irq = opaque;
 
-    irq->handler(irq->opaque, irq->n, !level);
+    return qemu_set_irq(irq, !level);
 }
 
 qemu_irq qemu_irq_invert(qemu_irq irq)
 {
     /* The default state for IRQs is low, so raise the output now.  */
     qemu_irq_raise(irq);
-    return qemu_allocate_irqs(qemu_notirq, irq, 1)[0];
+    return allocate_irqs(NULL, qemu_notirq, irq, 1)[0];
 }
diff --git a/hw/irq.h b/hw/irq.h
index 5daae44..eee03e6 100644
--- a/hw/irq.h
+++ b/hw/irq.h
@@ -3,15 +3,18 @@ 
 
 /* Generic IRQ/GPIO pin infrastructure.  */
 
-/* FIXME: Rmove one of these.  */
+#define QEMU_IRQ_DELIVERED      0
+#define QEMU_IRQ_COALESCED      (-1)
+#define QEMU_IRQ_MASKED         (-2)
+
 typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
-typedef void SetIRQFunc(void *opaque, int irq_num, int level);
+typedef int (*qemu_irq_fb_handler)(void *opaque, int n, int level);
 
-void qemu_set_irq(qemu_irq irq, int level);
+int qemu_set_irq(qemu_irq irq, int level);
 
-static inline void qemu_irq_raise(qemu_irq irq)
+static inline int qemu_irq_raise(qemu_irq irq)
 {
-    qemu_set_irq(irq, 1);
+    return qemu_set_irq(irq, 1);
 }
 
 static inline void qemu_irq_lower(qemu_irq irq)
@@ -19,14 +22,19 @@  static inline void qemu_irq_lower(qemu_irq irq)
     qemu_set_irq(irq, 0);
 }
 
-static inline void qemu_irq_pulse(qemu_irq irq)
+static inline int qemu_irq_pulse(qemu_irq irq)
 {
-    qemu_set_irq(irq, 1);
+    int ret;
+
+    ret = qemu_set_irq(irq, 1);
     qemu_set_irq(irq, 0);
+    return ret;
 }
 
 /* Returns an array of N IRQs.  */
 qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n);
+qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
+                                      void *opaque, int n);
 void qemu_free_irqs(qemu_irq *s);
 
 /* Returns a new IRQ with opposite polarity.  */