Message ID | bcfd0ba9-2763-421a-da9c-5a4993855301@skyportsystems.com |
---|---|
State | New |
Headers | show |
On Thu, Sep 01, 2016 at 10:57:48AM -0700, Ed Swierk wrote: > Windows 8, 10 and Server 2012 guests hang intermittently while booting > on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows > logo appears and the little dots start spinning. > > Running strace on qemu shows its main thread doing the following every > couple of milliseconds: > > ppoll([..., {fd=30</dev/xen/evtchn>, events=POLLIN|POLLERR|POLLHUP}, > ...], ...) = 1 ([{fd=30, revents=POLLIN}], ...) > read(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4 > write(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4 > ioctl(30</dev/xen/evtchn>, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0 > clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0 > clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0 > gettimeofday({1472251376, 673434}, NULL) = 0 > clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0 > gettimeofday({1472251376, 673679}, NULL) = 0 > > The event channel (identified by '^' or 94 in this example) is always > the third of the domain's four channels. > > Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and > http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar > issues, but don't help in this case. > > The proposed fix from > https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang > go away. It's not clear to me why it works, or if it's just papering > over a bug elsewhere, or if there are any possible side effects. CC-ing Denis. Is the fix below based on reading the spec or more of instrumenting? Thanks. > > Suggested-by: Andrew Jones <drjones@redhat.com> > Signed-off-by: Ed Swierk <eswierk@skyportsystems.com> > > diff --git a/hw/net/e1000.c b/hw/net/e1000.c > index 6eac66d..c891b67 100644 > --- a/hw/net/e1000.c > +++ b/hw/net/e1000.c > @@ -293,6 +293,8 @@ set_interrupt_cause(E1000State *s, int index, uint32_t val) > uint32_t pending_ints; > uint32_t mit_delay; > > + val &= s->mac_reg[IMS]; > + > s->mac_reg[ICR] = val; > > /* > @@ -305,7 +307,7 @@ set_interrupt_cause(E1000State *s, int index, uint32_t val) > */ > s->mac_reg[ICS] = val; > > - pending_ints = (s->mac_reg[IMS] & s->mac_reg[ICR]); > + pending_ints = s->mac_reg[ICR]; > if (!s->mit_irq_level && pending_ints) { > /* > * Here we detect a potential raising edge. We postpone raising the > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > https://lists.xen.org/xen-devel
On 09/13/2016 11:59 PM, Konrad Rzeszutek Wilk wrote: > On Thu, Sep 01, 2016 at 10:57:48AM -0700, Ed Swierk wrote: >> Windows 8, 10 and Server 2012 guests hang intermittently while booting >> on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows >> logo appears and the little dots start spinning. >> >> Running strace on qemu shows its main thread doing the following every >> couple of milliseconds: >> >> ppoll([..., {fd=30</dev/xen/evtchn>, events=POLLIN|POLLERR|POLLHUP}, >> ...], ...) = 1 ([{fd=30, revents=POLLIN}], ...) >> read(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4 >> write(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4 >> ioctl(30</dev/xen/evtchn>, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0 >> clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0 >> clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0 >> gettimeofday({1472251376, 673434}, NULL) = 0 >> clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0 >> gettimeofday({1472251376, 673679}, NULL) = 0 >> >> The event channel (identified by '^' or 94 in this example) is always >> the third of the domain's four channels. >> >> Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and >> http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar >> issues, but don't help in this case. >> >> The proposed fix from >> https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang >> go away. It's not clear to me why it works, or if it's just papering >> over a bug elsewhere, or if there are any possible side effects. > CC-ing Denis. > > > Is the fix below based on reading the spec or more of instrumenting? > > Thanks. hmm. I have looked in our older code (completely separate from QEMU). It does not have this trick. 2012r2 (as far as I remember) has a bug in the driver when LSC interrupt was raised unexpectedly during driver initialization. The original bug was about TXQE. Can you pls confirm which interrupt causes the storm? Den
On Thu, Sep 15, 2016 at 2:15 AM, Denis V. Lunev <den@openvz.org> wrote: > On 09/13/2016 11:59 PM, Konrad Rzeszutek Wilk wrote: > > On Thu, Sep 01, 2016 at 10:57:48AM -0700, Ed Swierk wrote: > >> Windows 8, 10 and Server 2012 guests hang intermittently while booting > >> on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows > >> logo appears and the little dots start spinning. > >> > >> Running strace on qemu shows its main thread doing the following every > >> couple of milliseconds: > >> > >> ppoll([..., {fd=30</dev/xen/evtchn>, events=POLLIN|POLLERR|POLLHUP}, > >> ...], ...) = 1 ([{fd=30, revents=POLLIN}], ...) > >> read(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4 > >> write(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4 > >> ioctl(30</dev/xen/evtchn>, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0 > >> clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0 > >> clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0 > >> gettimeofday({1472251376, 673434}, NULL) = 0 > >> clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0 > >> gettimeofday({1472251376, 673679}, NULL) = 0 > >> > >> The event channel (identified by '^' or 94 in this example) is always > >> the third of the domain's four channels. > >> > >> Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and > >> http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar > >> issues, but don't help in this case. > >> > >> The proposed fix from > >> https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang > >> go away. It's not clear to me why it works, or if it's just papering > >> over a bug elsewhere, or if there are any possible side effects. > > CC-ing Denis. > > > > > > Is the fix below based on reading the spec or more of instrumenting? > > > > Thanks. > > hmm. I have looked in our older code (completely separate > from QEMU). It does not have this trick. > > 2012r2 (as far as I remember) has a bug in the driver > when LSC interrupt was raised unexpectedly during driver > initialization. The original bug was about TXQE. Can you > pls confirm which interrupt causes the storm? How can I tell which interrupt is firing? Instrument the QEMU e1000 code? --Ed
On 09/15/2016 03:22 PM, Ed Swierk wrote: > On Thu, Sep 15, 2016 at 2:15 AM, Denis V. Lunev <den@openvz.org> wrote: >> On 09/13/2016 11:59 PM, Konrad Rzeszutek Wilk wrote: >>> On Thu, Sep 01, 2016 at 10:57:48AM -0700, Ed Swierk wrote: >>>> Windows 8, 10 and Server 2012 guests hang intermittently while booting >>>> on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows >>>> logo appears and the little dots start spinning. >>>> >>>> Running strace on qemu shows its main thread doing the following every >>>> couple of milliseconds: >>>> >>>> ppoll([..., {fd=30</dev/xen/evtchn>, events=POLLIN|POLLERR|POLLHUP}, >>>> ...], ...) = 1 ([{fd=30, revents=POLLIN}], ...) >>>> read(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4 >>>> write(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4 >>>> ioctl(30</dev/xen/evtchn>, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0 >>>> clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0 >>>> clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0 >>>> gettimeofday({1472251376, 673434}, NULL) = 0 >>>> clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0 >>>> gettimeofday({1472251376, 673679}, NULL) = 0 >>>> >>>> The event channel (identified by '^' or 94 in this example) is always >>>> the third of the domain's four channels. >>>> >>>> Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and >>>> http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar >>>> issues, but don't help in this case. >>>> >>>> The proposed fix from >>>> https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang >>>> go away. It's not clear to me why it works, or if it's just papering >>>> over a bug elsewhere, or if there are any possible side effects. >>> CC-ing Denis. >>> >>> >>> Is the fix below based on reading the spec or more of instrumenting? >>> >>> Thanks. >> hmm. I have looked in our older code (completely separate >> from QEMU). It does not have this trick. >> >> 2012r2 (as far as I remember) has a bug in the driver >> when LSC interrupt was raised unexpectedly during driver >> initialization. The original bug was about TXQE. Can you >> pls confirm which interrupt causes the storm? > How can I tell which interrupt is firing? Instrument the QEMU e1000 code? > > --Ed > connect with gdb to qemu, set break into set_interrupt_cause and dump value of s->mac_reg[ICS], s->mac_reg[IMS] and s->mac_reg[ICR] Bits are defined here hw/net/e1000_regs.h /* Interrupt Cause Read */ #define E1000_ICR_TXDW 0x00000001 /* Transmit desc written back */ #define E1000_ICR_TXQE 0x00000002 /* Transmit Queue empty */ Den
diff --git a/hw/net/e1000.c b/hw/net/e1000.c index 6eac66d..c891b67 100644 --- a/hw/net/e1000.c +++ b/hw/net/e1000.c @@ -293,6 +293,8 @@ set_interrupt_cause(E1000State *s, int index, uint32_t val) uint32_t pending_ints; uint32_t mit_delay; + val &= s->mac_reg[IMS]; + s->mac_reg[ICR] = val; /* @@ -305,7 +307,7 @@ set_interrupt_cause(E1000State *s, int index, uint32_t val) */ s->mac_reg[ICS] = val; - pending_ints = (s->mac_reg[IMS] & s->mac_reg[ICR]); + pending_ints = s->mac_reg[ICR]; if (!s->mit_irq_level && pending_ints) { /* * Here we detect a potential raising edge. We postpone raising the
Windows 8, 10 and Server 2012 guests hang intermittently while booting on Xen 4.5.3 with 1 vCPU and 4 e1000 vNICs, shortly after the Windows logo appears and the little dots start spinning. Running strace on qemu shows its main thread doing the following every couple of milliseconds: ppoll([..., {fd=30</dev/xen/evtchn>, events=POLLIN|POLLERR|POLLHUP}, ...], ...) = 1 ([{fd=30, revents=POLLIN}], ...) read(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4 write(30</dev/xen/evtchn>, "^\0\0\0", 4) = 4 ioctl(30</dev/xen/evtchn>, IOCTL_EVTCHN_NOTIFY, 0x7f1f9449d310) = 0 clock_gettime(CLOCK_MONOTONIC, {6937, 449468262}) = 0 clock_gettime(CLOCK_MONOTONIC, {6937, 449582903}) = 0 gettimeofday({1472251376, 673434}, NULL) = 0 clock_gettime(CLOCK_MONOTONIC, {6937, 449856205}) = 0 gettimeofday({1472251376, 673679}, NULL) = 0 The event channel (identified by '^' or 94 in this example) is always the third of the domain's four channels. Two recent qemu patches (http://git.qemu.org/?p=qemu.git;h=9596ef7c and http://git.qemu.org/?p=qemu.git;h=74004e8c) seem to address similar issues, but don't help in this case. The proposed fix from https://bugzilla.redhat.com/show_bug.cgi?id=874406#c78 makes the hang go away. It's not clear to me why it works, or if it's just papering over a bug elsewhere, or if there are any possible side effects. Suggested-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>