diff mbox

[RFC] e1000: Don't save writes to ICS/ICR masked by IMS

Message ID bcfd0ba9-2763-421a-da9c-5a4993855301@skyportsystems.com
State New
Headers show

Commit Message

Ed Swierk Sept. 1, 2016, 5:57 p.m. UTC
Windows 8, 10 and Server 2012 guests hang intermittently while booting
on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows
logo appears and the little dots start spinning.

Running strace on qemu shows its main thread doing the following every
couple of milliseconds:

 ppoll([..., {fd=30</dev/xen/evtchn>, events=POLLIN|POLLERR|POLLHUP},
        ...], ...) = 1 ([{fd=30, revents=POLLIN}], ...)
 read(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4
 write(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4
 ioctl(30</dev/xen/evtchn>, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0
 clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0
 clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0
 gettimeofday({1472251376, 673434}, NULL) = 0
 clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0
 gettimeofday({1472251376, 673679}, NULL) = 0

The event channel (identified by '^' or 94 in this example) is always
the third of the domain's four channels.

Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and
http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar
issues, but don't help in this case.

The proposed fix from
https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang
go away. It's not clear to me why it works, or if it's just papering
over a bug elsewhere, or if there are any possible side effects.

Suggested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>

Comments

Konrad Rzeszutek Wilk Sept. 13, 2016, 8:59 p.m. UTC | #1
On Thu, Sep 01, 2016 at 10:57:48AM -0700, Ed Swierk wrote:
> Windows 8, 10 and Server 2012 guests hang intermittently while booting
> on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows
> logo appears and the little dots start spinning.
> 
> Running strace on qemu shows its main thread doing the following every
> couple of milliseconds:
> 
>  ppoll([..., {fd=30</dev/xen/evtchn>, events=POLLIN|POLLERR|POLLHUP},
>         ...], ...) = 1 ([{fd=30, revents=POLLIN}], ...)
>  read(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4
>  write(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4
>  ioctl(30</dev/xen/evtchn>, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0
>  clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0
>  clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0
>  gettimeofday({1472251376, 673434}, NULL) = 0
>  clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0
>  gettimeofday({1472251376, 673679}, NULL) = 0
> 
> The event channel (identified by '^' or 94 in this example) is always
> the third of the domain's four channels.
> 
> Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and
> http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar
> issues, but don't help in this case.
> 
> The proposed fix from
> https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang
> go away. It's not clear to me why it works, or if it's just papering
> over a bug elsewhere, or if there are any possible side effects.

CC-ing Denis.


Is the fix below based on reading the spec or more of instrumenting?

Thanks.
> 
> Suggested-by: Andrew Jones <drjones@redhat.com>
> Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
> 
> diff --git a/hw/net/e1000.c b/hw/net/e1000.c
> index 6eac66d..c891b67 100644
> --- a/hw/net/e1000.c
> +++ b/hw/net/e1000.c
> @@ -293,6 +293,8 @@ set_interrupt_cause(E1000State *s, int index, uint32_t val)
>      uint32_t pending_ints;
>      uint32_t mit_delay;
>  
> +    val &= s->mac_reg[IMS];
> +
>      s->mac_reg[ICR] = val;
>  
>      /*
> @@ -305,7 +307,7 @@ set_interrupt_cause(E1000State *s, int index, uint32_t val)
>       */
>      s->mac_reg[ICS] = val;
>  
> -    pending_ints = (s->mac_reg[IMS] & s->mac_reg[ICR]);
> +    pending_ints = s->mac_reg[ICR];
>      if (!s->mit_irq_level && pending_ints) {
>          /*
>           * Here we detect a potential raising edge. We postpone raising the
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
Denis V. Lunev Sept. 15, 2016, 9:15 a.m. UTC | #2
On 09/13/2016 11:59 PM, Konrad Rzeszutek Wilk wrote:
> On Thu, Sep 01, 2016 at 10:57:48AM -0700, Ed Swierk wrote:
>> Windows 8, 10 and Server 2012 guests hang intermittently while booting
>> on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows
>> logo appears and the little dots start spinning.
>>
>> Running strace on qemu shows its main thread doing the following every
>> couple of milliseconds:
>>
>>  ppoll([..., {fd=30</dev/xen/evtchn>, events=POLLIN|POLLERR|POLLHUP},
>>         ...], ...) = 1 ([{fd=30, revents=POLLIN}], ...)
>>  read(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4
>>  write(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4
>>  ioctl(30</dev/xen/evtchn>, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0
>>  clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0
>>  clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0
>>  gettimeofday({1472251376, 673434}, NULL) = 0
>>  clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0
>>  gettimeofday({1472251376, 673679}, NULL) = 0
>>
>> The event channel (identified by '^' or 94 in this example) is always
>> the third of the domain's four channels.
>>
>> Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and
>> http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar
>> issues, but don't help in this case.
>>
>> The proposed fix from
>> https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang
>> go away. It's not clear to me why it works, or if it's just papering
>> over a bug elsewhere, or if there are any possible side effects.
> CC-ing Denis.
>
>
> Is the fix below based on reading the spec or more of instrumenting?
>
> Thanks.

hmm. I have looked in our older code (completely separate
from QEMU). It does not have this trick.

2012r2 (as far as I remember) has a bug in the driver
when LSC interrupt was raised unexpectedly during driver
initialization. The original bug was about TXQE. Can you
pls confirm which interrupt causes the storm?

Den
Ed Swierk Sept. 15, 2016, 12:22 p.m. UTC | #3
On Thu, Sep 15, 2016 at 2:15 AM, Denis V. Lunev <den@openvz.org> wrote:
> On 09/13/2016 11:59 PM, Konrad Rzeszutek Wilk wrote:
> > On Thu, Sep 01, 2016 at 10:57:48AM -0700, Ed Swierk wrote:
> >> Windows 8, 10 and Server 2012 guests hang intermittently while booting
> >> on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows
> >> logo appears and the little dots start spinning.
> >>
> >> Running strace on qemu shows its main thread doing the following every
> >> couple of milliseconds:
> >>
> >>  ppoll([..., {fd=30</dev/xen/evtchn>, events=POLLIN|POLLERR|POLLHUP},
> >>         ...], ...) = 1 ([{fd=30, revents=POLLIN}], ...)
> >>  read(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4
> >>  write(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4
> >>  ioctl(30</dev/xen/evtchn>, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0
> >>  clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0
> >>  clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0
> >>  gettimeofday({1472251376, 673434}, NULL) = 0
> >>  clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0
> >>  gettimeofday({1472251376, 673679}, NULL) = 0
> >>
> >> The event channel (identified by '^' or 94 in this example) is always
> >> the third of the domain's four channels.
> >>
> >> Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and
> >> http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar
> >> issues, but don't help in this case.
> >>
> >> The proposed fix from
> >> https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang
> >> go away. It's not clear to me why it works, or if it's just papering
> >> over a bug elsewhere, or if there are any possible side effects.
> > CC-ing Denis.
> >
> >
> > Is the fix below based on reading the spec or more of instrumenting?
> >
> > Thanks.
>
> hmm. I have looked in our older code (completely separate
> from QEMU). It does not have this trick.
>
> 2012r2 (as far as I remember) has a bug in the driver
> when LSC interrupt was raised unexpectedly during driver
> initialization. The original bug was about TXQE. Can you
> pls confirm which interrupt causes the storm?

How can I tell which interrupt is firing? Instrument the QEMU e1000 code?

--Ed
Denis V. Lunev Sept. 15, 2016, 12:46 p.m. UTC | #4
On 09/15/2016 03:22 PM, Ed Swierk wrote:
> On Thu, Sep 15, 2016 at 2:15 AM, Denis V. Lunev <den@openvz.org> wrote:
>> On 09/13/2016 11:59 PM, Konrad Rzeszutek Wilk wrote:
>>> On Thu, Sep 01, 2016 at 10:57:48AM -0700, Ed Swierk wrote:
>>>> Windows 8, 10 and Server 2012 guests hang intermittently while booting
>>>> on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows
>>>> logo appears and the little dots start spinning.
>>>>
>>>> Running strace on qemu shows its main thread doing the following every
>>>> couple of milliseconds:
>>>>
>>>>  ppoll([..., {fd=30</dev/xen/evtchn>, events=POLLIN|POLLERR|POLLHUP},
>>>>         ...], ...) = 1 ([{fd=30, revents=POLLIN}], ...)
>>>>  read(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4
>>>>  write(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4
>>>>  ioctl(30</dev/xen/evtchn>, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0
>>>>  clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0
>>>>  clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0
>>>>  gettimeofday({1472251376, 673434}, NULL) = 0
>>>>  clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0
>>>>  gettimeofday({1472251376, 673679}, NULL) = 0
>>>>
>>>> The event channel (identified by '^' or 94 in this example) is always
>>>> the third of the domain's four channels.
>>>>
>>>> Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and
>>>> http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar
>>>> issues, but don't help in this case.
>>>>
>>>> The proposed fix from
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang
>>>> go away. It's not clear to me why it works, or if it's just papering
>>>> over a bug elsewhere, or if there are any possible side effects.
>>> CC-ing Denis.
>>>
>>>
>>> Is the fix below based on reading the spec or more of instrumenting?
>>>
>>> Thanks.
>> hmm. I have looked in our older code (completely separate
>> from QEMU). It does not have this trick.
>>
>> 2012r2 (as far as I remember) has a bug in the driver
>> when LSC interrupt was raised unexpectedly during driver
>> initialization. The original bug was about TXQE. Can you
>> pls confirm which interrupt causes the storm?
> How can I tell which interrupt is firing? Instrument the QEMU e1000 code?
>
> --Ed
>
connect with gdb to qemu, set break into set_interrupt_cause
and dump value of  s->mac_reg[ICS], s->mac_reg[IMS] and s->mac_reg[ICR]

Bits are defined here hw/net/e1000_regs.h

/* Interrupt Cause Read */
#define E1000_ICR_TXDW          0x00000001 /* Transmit desc written back */
#define E1000_ICR_TXQE          0x00000002 /* Transmit Queue empty */

Den
diff mbox

Patch

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index 6eac66d..c891b67 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -293,6 +293,8 @@  set_interrupt_cause(E1000State *s, int index, uint32_t val)
     uint32_t pending_ints;
     uint32_t mit_delay;
 
+    val &= s->mac_reg[IMS];
+
     s->mac_reg[ICR] = val;
 
     /*
@@ -305,7 +307,7 @@  set_interrupt_cause(E1000State *s, int index, uint32_t val)
      */
     s->mac_reg[ICS] = val;
 
-    pending_ints = (s->mac_reg[IMS] & s->mac_reg[ICR]);
+    pending_ints = s->mac_reg[ICR];
     if (!s->mit_irq_level && pending_ints) {
         /*
          * Here we detect a potential raising edge. We postpone raising the