diff mbox

[2.6.30-rc4] r8169: avoid losing MSI interrupts

Message ID 200905231124.28925.mb@bu3sch.de
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Michael Buesch May 23, 2009, 9:24 a.m. UTC
On Saturday 23 May 2009 03:29:34 David Dillow wrote:
> The 8169 chip only generates MSI interrupts when all enabled event
> sources are quiescent and one or more sources transition to active. If
> not all of the active events are acknowledged, or a new event becomes
> active while the existing ones are cleared in the handler, we will not
> see a new interrupt.
> 
> The current interrupt handler masks off the Rx and Tx events once the
> NAPI handler has been scheduled, which opens a race window in which we
> can get another Rx or Tx event and never ACK'ing it, stopping all
> activity until the link is reset (ifconfig down/up). Fix this by always
> ACK'ing all event sources, and loop in the handler until we have all
> sources quiescent.
> 
> Signed-off-by: David Dillow <dave@thedillows.org>

Thanks a lot, Dave! This fixes the issue on my chip.
You can add my:

Tested-by: Michael Buesch <mb@bu3sch.de>



Here's a patch that cleanly applies to 2.6.29.4:

Comments

Michael Riepe May 23, 2009, 2:35 p.m. UTC | #1
Hi!

Michael Buesch wrote:

> Thanks a lot, Dave! This fixes the issue on my chip.

Yep, it's stable here as well. And even a little faster than pci=nomsi.
The only strangeness I observed is that the throughput (measured with
iperf and a single TCP connection) varies:

[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec    667 MBytes    559 Mbits/sec
[  3] 10.0-20.0 sec    803 MBytes    673 Mbits/sec
[  3] 20.0-30.0 sec    802 MBytes    673 Mbits/sec
[  3] 30.0-40.0 sec    714 MBytes    599 Mbits/sec
[  3] 40.0-50.0 sec    669 MBytes    561 Mbits/sec
[  3] 50.0-60.0 sec    791 MBytes    663 Mbits/sec
[  3]  0.0-60.0 sec  4.34 GBytes    622 Mbits/sec

In gkrellm, you can see that it actually alternates between two values.

With pci=nomsi (and without the patch) I get lower but more consistent
results:

[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec    511 MBytes    429 Mbits/sec
[  3] 10.0-20.0 sec    465 MBytes    390 Mbits/sec
[  3] 20.0-30.0 sec    481 MBytes    404 Mbits/sec
[  3] 30.0-40.0 sec    466 MBytes    391 Mbits/sec
[  3] 40.0-50.0 sec    465 MBytes    390 Mbits/sec
[  3] 50.0-60.0 sec    463 MBytes    389 Mbits/sec
[  3]  0.0-60.0 sec  2.78 GBytes    399 Mbits/sec

I suppose it's a side effect of the MSI acknowledgement loop. But who am
I to complain about higher average throughput? ;-)

> You can add my:
> 
> Tested-by: Michael Buesch <mb@bu3sch.de>

Tested-by: Michael Riepe <michael.riepe@googlemail.com>
Michael Buesch May 23, 2009, 2:44 p.m. UTC | #2
On Saturday 23 May 2009 16:35:28 Michael Riepe wrote:
> Hi!
> 
> Michael Buesch wrote:
> 
> > Thanks a lot, Dave! This fixes the issue on my chip.
> 
> Yep, it's stable here as well. And even a little faster than pci=nomsi.
> The only strangeness I observed is that the throughput (measured with
> iperf and a single TCP connection) varies:
> 
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec    667 MBytes    559 Mbits/sec
> [  3] 10.0-20.0 sec    803 MBytes    673 Mbits/sec
> [  3] 20.0-30.0 sec    802 MBytes    673 Mbits/sec
> [  3] 30.0-40.0 sec    714 MBytes    599 Mbits/sec
> [  3] 40.0-50.0 sec    669 MBytes    561 Mbits/sec
> [  3] 50.0-60.0 sec    791 MBytes    663 Mbits/sec
> [  3]  0.0-60.0 sec  4.34 GBytes    622 Mbits/sec

Are you running the iperf server or client on the r8169?
I'm running the iperf server on the r8169 and get the following results:

mb@homer:~$ iperf -c 192.168.2.50 -t120 -i10
------------------------------------------------------------
Client connecting to 192.168.2.50, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.205 port 34739 connected with 192.168.2.50 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec    789 MBytes    662 Mbits/sec
[  3] 10.0-20.0 sec    791 MBytes    664 Mbits/sec
[  3] 20.0-30.0 sec    799 MBytes    670 Mbits/sec
[  3] 30.0-40.0 sec    794 MBytes    666 Mbits/sec
[  3] 40.0-50.0 sec    795 MBytes    667 Mbits/sec
[  3] 50.0-60.0 sec    793 MBytes    665 Mbits/sec
[  3] 60.0-70.0 sec    800 MBytes    671 Mbits/sec
[  3] 70.0-80.0 sec    795 MBytes    667 Mbits/sec
[  3] 80.0-90.0 sec    800 MBytes    672 Mbits/sec
[  3] 90.0-100.0 sec    797 MBytes    669 Mbits/sec
[  3] 100.0-110.0 sec    789 MBytes    662 Mbits/sec
[  3] 110.0-120.0 sec    791 MBytes    663 Mbits/sec
[  3]  0.0-120.0 sec  9.31 GBytes    666 Mbits/sec

Looks acceptable to me.
The iperf client device is a BCM5780.
David Dillow May 23, 2009, 2:51 p.m. UTC | #3
On Sat, 2009-05-23 at 16:35 +0200, Michael Riepe wrote:
> Hi!
> 
> Michael Buesch wrote:
> 
> > Thanks a lot, Dave! This fixes the issue on my chip.
> 
> Yep, it's stable here as well. And even a little faster than pci=nomsi.
> The only strangeness I observed is that the throughput (measured with
> iperf and a single TCP connection) varies:
> 
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec    667 MBytes    559 Mbits/sec
> [  3] 10.0-20.0 sec    803 MBytes    673 Mbits/sec
> [  3] 20.0-30.0 sec    802 MBytes    673 Mbits/sec
> [  3] 30.0-40.0 sec    714 MBytes    599 Mbits/sec
> [  3] 40.0-50.0 sec    669 MBytes    561 Mbits/sec
> [  3] 50.0-60.0 sec    791 MBytes    663 Mbits/sec
> [  3]  0.0-60.0 sec  4.34 GBytes    622 Mbits/sec
> 
[snip]
> I suppose it's a side effect of the MSI acknowledgement loop. But who am
> I to complain about higher average throughput? ;-)

I wonder if that is the TCP sawtooth pattern -- run up until we drop
packets, drop off, repeat. I thought newer congestion algorithms would
help with that, but I've not kept up, this may be another red-herring --
like the bisection into genirq.

A tcpdump may answer the question -- wireshark can do an analysis and
see if it is running up until it starts dropping or something.

Or it may be the loop, but I wouldn't expect it to make such a big
difference, or be as variable if it does.

Also, what does it look like with multiple streams?

Thanks for testing guys -- I'm glad it works for a sample size > 1!

Dave

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Riepe May 23, 2009, 3:01 p.m. UTC | #4
Hi!

Michael Buesch wrote:
> On Saturday 23 May 2009 16:35:28 Michael Riepe wrote:
> 
>>Hi!
>>
>>Michael Buesch wrote:
>>
>>
>>>Thanks a lot, Dave! This fixes the issue on my chip.
>>
>>Yep, it's stable here as well. And even a little faster than pci=nomsi.
>>The only strangeness I observed is that the throughput (measured with
>>iperf and a single TCP connection) varies:
>>
>>[ ID] Interval       Transfer     Bandwidth
>>[  3]  0.0-10.0 sec    667 MBytes    559 Mbits/sec
>>[  3] 10.0-20.0 sec    803 MBytes    673 Mbits/sec
>>[  3] 20.0-30.0 sec    802 MBytes    673 Mbits/sec
>>[  3] 30.0-40.0 sec    714 MBytes    599 Mbits/sec
>>[  3] 40.0-50.0 sec    669 MBytes    561 Mbits/sec
>>[  3] 50.0-60.0 sec    791 MBytes    663 Mbits/sec
>>[  3]  0.0-60.0 sec  4.34 GBytes    622 Mbits/sec
> 
> 
> Are you running the iperf server or client on the r8169?

The client, since I wanted to measure write throughput. The server is a
Lenovo Thinkpad T60 (e1000e driver).

> I'm running the iperf server on the r8169 and get the following results:
> 
> mb@homer:~$ iperf -c 192.168.2.50 -t120 -i10

In that case, the r8169 is receiving data (which never was a problem, at
least for me - it only lost interrupts in TX mode).
Michael Riepe May 23, 2009, 4:12 p.m. UTC | #5
Hi!

David Dillow wrote:

> I wonder if that is the TCP sawtooth pattern -- run up until we drop
> packets, drop off, repeat. I thought newer congestion algorithms would
> help with that, but I've not kept up, this may be another red-herring --
> like the bisection into genirq.

Actually, I just found out that things are much stranger. A freshly
booted system (I'm using 2.6.29.2 + the r8169 patch sent by Michael
Buesch, by the way) behaves like this:

[  3] local 192.168.178.206 port 44090 connected with 192.168.178.204
port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec    483 MBytes    405 Mbits/sec
[  3] 10.0-20.0 sec    472 MBytes    396 Mbits/sec
[  3] 20.0-30.0 sec    482 MBytes    404 Mbits/sec
[  3] 30.0-40.0 sec    483 MBytes    405 Mbits/sec
[  3] 40.0-50.0 sec    480 MBytes    402 Mbits/sec
[  3] 50.0-60.0 sec    479 MBytes    402 Mbits/sec
[  3]  0.0-60.0 sec  2.81 GBytes    402 Mbits/sec

Then I've been running another test, something along the lines of

	for dest in host1 host1 host2 host2
	do ssh $dest dd of=/dev/null bs=8k count=10240000 </dev/zero &
	done

After a while, I killed the ssh processes and ran iperf again. And this
time, I got:

[  3] local 192.168.178.206 port 58029 connected with 192.168.178.204
port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec    634 MBytes    531 Mbits/sec
[  3] 10.0-20.0 sec    740 MBytes    621 Mbits/sec
[  3] 20.0-30.0 sec    641 MBytes    538 Mbits/sec
[  3] 30.0-40.0 sec    738 MBytes    619 Mbits/sec
[  3] 40.0-50.0 sec    742 MBytes    622 Mbits/sec
[  3] 50.0-60.0 sec    743 MBytes    623 Mbits/sec
[  3]  0.0-60.0 sec  4.14 GBytes    592 Mbits/sec

Obviously, the high-load ssh test (which would kill the device within a
few seconds without the patch) triggers something here.

A few observations later, however, I was convinced that it's not a TCP
congestion or driver issue. Actually, the throughput depends on the CPU
the benchmark is running on. You can see that in gkrellm - whenever the
process jumps to another CPU, the throughput changes. On the four
(virtual) CPUs of the Atom 330, I get these results:

CPU 0:  0.0-60.0 sec  2.65 GBytes    380 Mbits/sec
CPU 1:  0.0-60.0 sec  4.12 GBytes    590 Mbits/sec
CPU 2:  0.0-60.0 sec  3.79 GBytes    543 Mbits/sec
CPU 3:  0.0-60.0 sec  4.13 GBytes    592 Mbits/sec

CPU 0+2 are on the first core, 1+3 on the second.

If I use two connections (iperf -P2) and nail iperf to both threads of a
single core with taskset (the program is multi-threaded, just in case
you wonder), I get this:

CPU 0+2:  0.0-60.0 sec  4.65 GBytes    665 Mbits/sec
CPU 1+3:  0.0-60.0 sec  6.43 GBytes    920 Mbits/sec

That's quite a difference, isn't it?

Now I wonder what CPU 0 is doing...
Michael Buesch May 23, 2009, 4:40 p.m. UTC | #6
On Saturday 23 May 2009 17:01:36 Michael Riepe wrote:
> The client, since I wanted to measure write throughput. The server is a
> Lenovo Thinkpad T60 (e1000e driver).
> 
> > I'm running the iperf server on the r8169 and get the following results:
> > 
> > mb@homer:~$ iperf -c 192.168.2.50 -t120 -i10
> 
> In that case, the r8169 is receiving data (which never was a problem, at
> least for me - it only lost interrupts in TX mode).

Ok, I tested the other way around and it shows the problem:

mb@quimby:~/kernel$ iperf -c 192.168.2.205 -t120 -i10
------------------------------------------------------------
Client connecting to 192.168.2.205, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.50 port 34788 connected with 192.168.2.205 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec    501 MBytes    420 Mbits/sec
[  3] 10.0-20.0 sec    533 MBytes    447 Mbits/sec
[  3] 20.0-30.0 sec    699 MBytes    587 Mbits/sec
[  3] 30.0-40.0 sec    715 MBytes    600 Mbits/sec
[  3] 40.0-50.0 sec    469 MBytes    394 Mbits/sec
[  3] 50.0-60.0 sec    466 MBytes    391 Mbits/sec
[  3] 60.0-70.0 sec    464 MBytes    389 Mbits/sec
[  3] 70.0-80.0 sec    498 MBytes    418 Mbits/sec
[  3] 80.0-90.0 sec    477 MBytes    400 Mbits/sec
[  3] 90.0-100.0 sec    601 MBytes    504 Mbits/sec
[  3] 100.0-110.0 sec    696 MBytes    584 Mbits/sec
[  3] 110.0-120.0 sec    510 MBytes    428 Mbits/sec
[  3]  0.0-120.0 sec  6.47 GBytes    463 Mbits/sec
Michael Buesch May 23, 2009, 4:45 p.m. UTC | #7
On Saturday 23 May 2009 18:12:22 Michael Riepe wrote:

> CPU 0+2:  0.0-60.0 sec  4.65 GBytes    665 Mbits/sec
> CPU 1+3:  0.0-60.0 sec  6.43 GBytes    920 Mbits/sec
> 
> That's quite a difference, isn't it?
> 
> Now I wonder what CPU 0 is doing...

Is CPU0 the only one servicing interrupts?
David Dillow May 23, 2009, 4:46 p.m. UTC | #8
On Sat, 2009-05-23 at 18:12 +0200, Michael Riepe wrote:
> If I use two connections (iperf -P2) and nail iperf to both threads of a
> single core with taskset (the program is multi-threaded, just in case
> you wonder), I get this:
> 
> CPU 0+2:  0.0-60.0 sec  4.65 GBytes    665 Mbits/sec
> CPU 1+3:  0.0-60.0 sec  6.43 GBytes    920 Mbits/sec
> 
> That's quite a difference, isn't it?
> 
> Now I wonder what CPU 0 is doing...

Where does /proc/interrupts say the irqs are going?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Buesch May 23, 2009, 4:50 p.m. UTC | #9
On Saturday 23 May 2009 18:46:13 David Dillow wrote:
> On Sat, 2009-05-23 at 18:12 +0200, Michael Riepe wrote:
> > If I use two connections (iperf -P2) and nail iperf to both threads of a
> > single core with taskset (the program is multi-threaded, just in case
> > you wonder), I get this:
> > 
> > CPU 0+2:  0.0-60.0 sec  4.65 GBytes    665 Mbits/sec
> > CPU 1+3:  0.0-60.0 sec  6.43 GBytes    920 Mbits/sec
> > 
> > That's quite a difference, isn't it?
> > 
> > Now I wonder what CPU 0 is doing...
> 
> Where does /proc/interrupts say the irqs are going?
> 
> 
> 

For me it looks like this:
 24:   52606581          0          0          0   PCI-MSI-edge      eth0

So as I have the same CPU as Michael Riepe, I think CPU0 (core0) servicing
interrupts is related to the issue.

I'm wondering however, if this is expected behavior. iperf and interrupts will
pretty much saturate a core of the atom330. So it looks like a plain CPU bottleneck.
Michael Riepe May 23, 2009, 4:53 p.m. UTC | #10
David Dillow wrote:
> On Sat, 2009-05-23 at 18:12 +0200, Michael Riepe wrote:
> 
>>If I use two connections (iperf -P2) and nail iperf to both threads of a
>>single core with taskset (the program is multi-threaded, just in case
>>you wonder), I get this:
>>
>>CPU 0+2:  0.0-60.0 sec  4.65 GBytes    665 Mbits/sec
>>CPU 1+3:  0.0-60.0 sec  6.43 GBytes    920 Mbits/sec
>>
>>That's quite a difference, isn't it?
>>
>>Now I wonder what CPU 0 is doing...
> 
> 
> Where does /proc/interrupts say the irqs are going?

Oh well...

           CPU0       CPU1       CPU2       CPU3
  0:         66          0          0          0   IO-APIC-edge      timer
  1:          2          0          0          0   IO-APIC-edge      i8042
  7:          0          0          0          0   IO-APIC-edge
parport0
  8:          1          0          0          0   IO-APIC-edge      rtc0
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          4          0          0          0   IO-APIC-edge      i8042
 14:          0          0          0          0   IO-APIC-edge
ata_piix
 15:          0          0          0          0   IO-APIC-edge
ata_piix
 16:          0          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb4
 18:          0          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb3
 19:       9696          0          0          0   IO-APIC-fasteoi
ata_piix, uhci_hcd:usb2
 22:        471          0          0          0   IO-APIC-fasteoi   HDA
Intel
 23:        100          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb1, ehci_hcd:usb5
 27:   48463995          0          0          0   PCI-MSI-edge      eth0
NMI:          0          0          0          0   Non-maskable interrupts
LOC:     460965     294620     324902     358771   Local timer interrupts
RES:      30111     286150     113295     268387   Rescheduling interrupts
CAL:       1140       1143         52         34   Function call interrupts
TLB:       2223        952       1184       2573   TLB shootdowns
SPU:          0          0          0          0   Spurious interrupts
ERR:          0
MIS:          0
David Dillow May 23, 2009, 5:03 p.m. UTC | #11
On Sat, 2009-05-23 at 18:53 +0200, Michael Riepe wrote:
> 
> David Dillow wrote:
> > On Sat, 2009-05-23 at 18:12 +0200, Michael Riepe wrote:
> > 
> >>If I use two connections (iperf -P2) and nail iperf to both threads of a
> >>single core with taskset (the program is multi-threaded, just in case
> >>you wonder), I get this:
> >>
> >>CPU 0+2:  0.0-60.0 sec  4.65 GBytes    665 Mbits/sec
> >>CPU 1+3:  0.0-60.0 sec  6.43 GBytes    920 Mbits/sec
> >>
> >>That's quite a difference, isn't it?
> >>
> >>Now I wonder what CPU 0 is doing...
> > 
> > 
> > Where does /proc/interrupts say the irqs are going?
> 
> Oh well...

>  27:   48463995          0          0          0   PCI-MSI-edge      eth0

What does it look like if you move the iperf around the CPUs while using
pci=nomsi? 

I'm looking to make sure I didn't cause a terrible regression in the
cost of IRQ handling...

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6.29/drivers/net/r8169.c
===================================================================
--- linux-2.6.29.orig/drivers/net/r8169.c	2009-05-23 11:06:22.000000000 +0200
+++ linux-2.6.29/drivers/net/r8169.c	2009-05-23 11:08:17.000000000 +0200
@@ -3554,54 +3554,64 @@ 
 	int handled = 0;
 	int status;
 
+	/* loop handling interrupts until we have no new ones or
+	 * we hit a invalid/hotplug case.
+	 */
 	status = RTL_R16(IntrStatus);
+	while (status && status != 0xffff) {
+		handled = 1;
 
-	/* hotplug/major error/no more work/shared irq */
-	if ((status == 0xffff) || !status)
-		goto out;
-
-	handled = 1;
+		/* Handle all of the error cases first. These will reset
+		 * the chip, so just exit the loop.
+		 */
+		if (unlikely(!netif_running(dev))) {
+			rtl8169_asic_down(ioaddr);
+			break;
+		}
 
-	if (unlikely(!netif_running(dev))) {
-		rtl8169_asic_down(ioaddr);
-		goto out;
-	}
+		/* Work around for rx fifo overflow */
+		if (unlikely(status & RxFIFOOver) &&
+	    	(tp->mac_version == RTL_GIGA_MAC_VER_11)) {
+			netif_stop_queue(dev);
+			rtl8169_tx_timeout(dev);
+			break;
+		}
 
-	status &= tp->intr_mask;
-	RTL_W16(IntrStatus,
-		(status & RxFIFOOver) ? (status | RxOverflow) : status);
+		if (unlikely(status & SYSErr)) {
+			rtl8169_pcierr_interrupt(dev);
+			break;
+		}
 
-	if (!(status & tp->intr_event))
-		goto out;
+		if (status & LinkChg)
+			rtl8169_check_link_status(dev, tp, ioaddr);
 
-	/* Work around for rx fifo overflow */
-	if (unlikely(status & RxFIFOOver) &&
-	    (tp->mac_version == RTL_GIGA_MAC_VER_11)) {
-		netif_stop_queue(dev);
-		rtl8169_tx_timeout(dev);
-		goto out;
-	}
+		/* We need to see the lastest version of tp->intr_mask to
+		 * avoid ignoring an MSI interrupt and having to wait for
+		 * another event which may never come.
+		 */
+		smp_rmb();
+		if (status & tp->intr_mask & tp->napi_event) {
+			RTL_W16(IntrMask, tp->intr_event & ~tp->napi_event);
+			tp->intr_mask = ~tp->napi_event;
+
+			if (likely(netif_rx_schedule_prep(&tp->napi)))
+				__netif_rx_schedule(&tp->napi);
+			else if (netif_msg_intr(tp)) {
+				printk(KERN_INFO "%s: interrupt %04x in poll\n",
+			       	dev->name, status);
+			}
+		}
 
-	if (unlikely(status & SYSErr)) {
-		rtl8169_pcierr_interrupt(dev);
-		goto out;
+		/* We only get a new MSI interrupt when all active irq
+		 * sources on the chip have been acknowledged. So, ack
+		 * everything we've seen and check if new sources have become
+		 * active to avoid blocking all interrupts from the chip.
+		 */
+		RTL_W16(IntrStatus,
+			(status & RxFIFOOver) ? (status | RxOverflow) : status);
+		status = RTL_R16(IntrStatus);
 	}
 
-	if (status & LinkChg)
-		rtl8169_check_link_status(dev, tp, ioaddr);
-
-	if (status & tp->napi_event) {
-		RTL_W16(IntrMask, tp->intr_event & ~tp->napi_event);
-		tp->intr_mask = ~tp->napi_event;
-
-		if (likely(netif_rx_schedule_prep(&tp->napi)))
-			__netif_rx_schedule(&tp->napi);
-		else if (netif_msg_intr(tp)) {
-			printk(KERN_INFO "%s: interrupt %04x in poll\n",
-			       dev->name, status);
-		}
-	}
-out:
 	return IRQ_RETVAL(handled);
 }
 
@@ -3617,13 +3627,15 @@ 
 
 	if (work_done < budget) {
 		netif_rx_complete(napi);
-		tp->intr_mask = 0xffff;
-		/*
-		 * 20040426: the barrier is not strictly required but the
-		 * behavior of the irq handler could be less predictable
-		 * without it. Btw, the lack of flush for the posted pci
-		 * write is safe - FR
+
+		/* We need for force the visibility of tp->intr_mask
+		 * for other CPUs, as we can loose an MSI interrupt
+		 * and potentially wait for a retransmit timeout if we don't.
+		 * The posted write to IntrMask is safe, as it will
+		 * eventually make it to the chip and we won't loose anything
+		 * until it does.
 		 */
+		tp->intr_mask = 0xffff;
 		smp_wmb();
 		RTL_W16(IntrMask, tp->intr_event);
 	}