diff mbox

[Bug,16142] r8169: Kernel Panic when a lot of data is transferred through network interface

Message ID 20100623080200.GB5010@liondog.tnic
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Borislav Petkov June 23, 2010, 8:02 a.m. UTC
From: Hans Mueller <hans42mueller@googlemail.com>
Date: Sat, Jun 19, 2010 at 01:27:08PM +0200

Right, ok, so this one is starting to get reeeal nasty...

Thanks for the logs.

> The hang seems to occur only if used my testcase (copying the file via
> scp (log11)). If I just boot the system and than shut it down again, the
> hang did not happen (log 10)

yeah, about that, can you boot your machine and do

cat /proc/interrupts > irqs

and

dmesg > dmesg.log

and send me the two files. It could be that the ide layer doesn't see
any interrupts anymore...

Also, when you shutdown after having done your test case, do you see any
activity after the "task ... blocked" backtrace? IOW, your log11 shows
that the device is processing some more requests but it could be because
of that other bug you said the capturing program had and maybe because
the kernel log buffer is not empty yet...

Here's the next debugging patch :), this one enables ide-cd full debug,
please apply and catch the output when shutting down.

Thanks for your hard work!

---

Comments

Hans Mueller June 25, 2010, 4:58 p.m. UTC | #1
Okey, I hope I captured all the data you asked me for


On Wed, 23 Jun 2010 10:02:00 +0200
Borislav Petkov <bp@alien8.de> wrote:
> cat /proc/interrupts > irqs
> dmesg > dmesg.log

They are attached 


> Also, when you shutdown after having done your test case, do you see any
> activity after the "task ... blocked" backtrace? 

log12 contains everything exactly how I saw it on my screen; the data I
saw, ended with "[...] comp", too.

log12 contains the kernel buffers data from the time between starting
the transfer and the hang while shutting down.
Borislav Petkov June 30, 2010, 6:54 a.m. UTC | #2
From: Hans Mueller <hans42mueller@googlemail.com>
Date: Fri, Jun 25, 2010 at 06:58:46PM +0200

Hi, sorry for the delay.

Right, and I had a suspicion about sharing IRQs with the NIC:

            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
   0:      42503          0          0          0          0          0          0          0   IO-APIC-edge      timer
   1:          2          0          0          0          0          0          0          0   IO-APIC-edge      i8042
   4:          2          0          0          0          0          0          0          0   IO-APIC-edge
   9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
  12:          4          0          0          0          0          0          0          0   IO-APIC-edge      i8042
  14:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide2
  15:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide3
  16:         26          0          0          0          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb1
  17:        127          0          0          0          0          0          0          0   IO-APIC-fasteoi   hda_intel
  18:         47          0          0          0          0          0          0          0   IO-APIC-fasteoi   firewire_ohci, ahci
  19:        753          0          0          0          0          0          0          0   IO-APIC-fasteoi   ide0, ide1, eth0
  21:       3989          0          0          0          0          0          0          0   IO-APIC-fasteoi   ahci

so IRQ19 is shared between the nic and the first ide controller and the
sata controller is using another irq line which could explain why the
issue doesn't happen with libata. Is your nick a pluggable card and if
yes, can you move it to another PCI slot so that ide0 and ide1 don't
share the same irq line with eth0 and retest again? Before retesting
though, do 'cat /proc/interrupts' to make sure.

I'm guessing the problem will go away then...

Thanks.
Hans Mueller June 30, 2010, 6:02 p.m. UTC | #3
Hi,

On Wed, 30 Jun 2010 08:54:26 +0200
Borislav Petkov <bp@alien8.de> wrote:

> Is your nick a pluggable card [...]
No sorry, it's an onboard card.
Furthermore, I don't know how to change the IRQ using linux. (Or rather
if it's possible at all)
I didn't find a possibility in my board's bios to change the IRQ
mappings, too.
Borislav Petkov June 30, 2010, 6:31 p.m. UTC | #4
From: Hans Mueller <hans42mueller@googlemail.com>
Date: Wed, Jun 30, 2010 at 08:02:54PM +0200

> On Wed, 30 Jun 2010 08:54:26 +0200
> Borislav Petkov <bp@alien8.de> wrote:
> 
> > Is your nick a pluggable card [...]
> No sorry, it's an onboard card.
> Furthermore, I don't know how to change the IRQ using linux. (Or rather
> if it's possible at all)
> I didn't find a possibility in my board's bios to change the IRQ
> mappings, too.

Ok, first you can try something which is real easy: I see you have an
ide2 and ide3 channels each having their own irq line. You could move
the cdrom connector to the other ide controller and test again.

Alternatively, if you have a spare PCI NIC, you can insert it into one
of the PCI slots after having disabled the onboard NIC in the BIOS. Just
for testing purposes, to see whether "unsharing" the IRQ line fixes the
issue.

Thanks.
Hans Mueller July 3, 2010, 7:49 a.m. UTC | #5
Hi, 

On Wed, 30 Jun 2010 20:31:38 +0200
Borislav Petkov <bp@alien8.de> wrote:
> Alternatively, if you have a spare PCI NIC, you can insert it into one
> of the PCI slots after having disabled the onboard NIC in the BIOS. Just
> for testing purposes, to see whether "unsharing" the IRQ line fixes the
> issue.

I have currently no access to the computer, I will be able to test
things again on monday.
But as I wrote in my original bugreport, I tested with a PCI NIC. (But
the onboard NIC was not completely disabled, it was only disabled via
ifconfig ... down)
When using the PCI NIC, the whole problem (kernel panik) did not occure.


--
Regards Jonas
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov July 3, 2010, 8:23 a.m. UTC | #6
From: Hans Mueller <hans42mueller@googlemail.com>
Date: Sat, Jul 03, 2010 at 09:49:18AM +0200

> Hi, 
> 
> On Wed, 30 Jun 2010 20:31:38 +0200
> Borislav Petkov <bp@alien8.de> wrote:
> > Alternatively, if you have a spare PCI NIC, you can insert it into one
> > of the PCI slots after having disabled the onboard NIC in the BIOS. Just
> > for testing purposes, to see whether "unsharing" the IRQ line fixes the
> > issue.
> 
> I have currently no access to the computer, I will be able to test
> things again on monday.
> But as I wrote in my original bugreport, I tested with a PCI NIC. (But
> the onboard NIC was not completely disabled, it was only disabled via
> ifconfig ... down)
> When using the PCI NIC, the whole problem (kernel panik) did not occure.

Ok, this confirms my suspicion that it is shared-irq related. Also, we
already verified that switching to libata does fix the issue for you so
you are good to go. Considering the DEPRECATED status of ide, I have a
very little incentive in hunting this thing further down, so let's leave
it at that.

I'll send the first fix to Dave since it is still needed and add a note
to bugzilla for further reference.

Jonas, big thanks for your hard work with testing patches and ideas. I
really appreciate it! :)
Hans Mueller July 3, 2010, 11:41 a.m. UTC | #7
Hi

On Sat, 3 Jul 2010 10:23:04 +020I0
Borislav Petkov <bp@alien8.de> wrote:
> Ok, this confirms my suspicion that it is shared-irq related. Also, we
> already verified that switching to libata does fix the issue for you so
> you are good to go. Considering the DEPRECATED status of ide, I have a
> very little incentive in hunting this thing further down, so let's leave
> it at that.

Okey good. :)

 
> Jonas, big thanks for your hard work with testing patches and ideas. I
> really appreciate it! :)
You're welcome. :)

Big thanks to all of you who tried to resolve the bug, or helped in
another way.
I do especially emphasize this, as it seems not to standard to answer
bugreports at all. (At least on the b43 bugreport list; I reported a
bug as their driver seems to have destroyed my wifi card. Don't
missundersstand me I know this can happen, but I exspected the will to
stop the driver from destroying other peoples hardware.)


--
Regards/Gruss Jonas
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov July 3, 2010, 11:53 a.m. UTC | #8
From: Hans Mueller <hans42mueller@googlemail.com>
Date: Sat, Jul 03, 2010 at 01:41:49PM +0200

> I do especially emphasize this, as it seems not to standard to answer
> bugreports at all.

It's a shame that users get that impression but sadly I must admit I
know what you mean and yes, we should try harder instead of hacking in
gazillion new features. But the answer is simple: writing new features
is much more fun than bughunting...

> (At least on the b43 bugreport list; I reported a bug as their driver
> seems to have destroyed my wifi card. Don't missundersstand me I know
> this can happen, but I exspected the will to stop the driver from
> destroying other peoples hardware.)

Hmm, that's strange. I remember reading that while
there's no real maintainer of the driver, there are
still people doing some work on it, according to this:
http://marc.info/?l=linux-wireless&m=127747220616577&w=2

Did you also add the wireless maintainer to the Cc of your bugreport
- "John W. Linville" <linville@tuxdriver.com> - along with a detailed
description of what the problem is, which kernel, how to reproduce along
with dmesg?
Hans Mueller July 5, 2010, 11:44 a.m. UTC | #9
Hi, 

even if this is quite offtopic and I'm not sure whether linux-ide
sould be still in the Cc field, I answert to all to leave nobody with
half of the story :)

On Sat, 3 Jul 2010 13:53:05 +0200
Borislav Petkov <bp@alien8.de> wrote:

> Did you also add the wireless maintainer to the Cc of your bugreport
> - "John W. Linville" <linville@tuxdriver.com>  [...]

No that is the only thing I did not. I followed the instructions from:
http://linuxwireless.org/en/users/Drivers/b43#bug_reporting
It wasn't mentioned to Cc anybody as far as I know :)



> [...] along with a detailed
> description of what the problem is, which kernel, how to reproduce along
> with dmesg?

I attached the text of the original mail (but not the attachments as
there should be no need for them in here; if I am wrong with this, ask
me for them :) )
If there is anything wrong with the bugreport, feel free to criticize.


I am aware that this is offtopic, so do not spend to much time
on it, I am going to send a copy of the original mail to the wireless
maintainer, as you suggested, and see what will happen.


Gruss / Regards,
Jonas
Borislav Petkov July 5, 2010, 12:11 p.m. UTC | #10
From: Hans Mueller <hans42mueller@googlemail.com>
Date: Mon, Jul 05, 2010 at 01:44:01PM +0200

> even if this is quite offtopic and I'm not sure whether linux-ide
> sould be still in the Cc field, I answert to all to leave nobody with
> half of the story :)

Right.

> > Did you also add the wireless maintainer to the Cc of your bugreport
> > - "John W. Linville" <linville@tuxdriver.com>  [...]
> 
> No that is the only thing I did not. I followed the instructions from:
> http://linuxwireless.org/en/users/Drivers/b43#bug_reporting
> It wasn't mentioned to Cc anybody as far as I know :)
> 
> 
> 
> > [...] along with a detailed
> > description of what the problem is, which kernel, how to reproduce along
> > with dmesg?
> 
> I attached the text of the original mail (but not the attachments as
> there should be no need for them in here; if I am wrong with this, ask
> me for them :) )
> If there is anything wrong with the bugreport, feel free to criticize.

Yep, it looks good. Bottom line is: The bug report should try to
plausibly lay out what the symptoms are and how to reproduce them, if
possible. Better be too verbose than not to mention something which
might turn out important.

Btw, does MacOS recognize your wlan card at all or is it completely
bricked? And does it freeze only after you reboot from Linux?

> I am aware that this is offtopic, so do not spend to much time
> on it, I am going to send a copy of the original mail to the wireless
> maintainer, as you suggested, and see what will happen.

Yes, good luck :)
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 9fe174d..1213e13 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -173,9 +173,9 @@  void blk_dump_rq_flags(struct request *rq, char *msg)
 {
 	int bit;
 
-	printk(KERN_INFO "%s: dev %s: type=%x, flags=%x\n", msg,
+	printk(KERN_INFO "%s: dev %s: type=%x, flags=%x, ref_count: %d\n", msg,
 		rq->rq_disk ? rq->rq_disk->disk_name : "?", rq->cmd_type,
-		rq->cmd_flags);
+		rq->cmd_flags, rq->ref_count);
 
 	printk(KERN_INFO "  sector %llu, nr/cnr %u/%u\n",
 	       (unsigned long long)blk_rq_pos(rq),
diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 64207df..cefcaf4 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -448,6 +448,7 @@  int ide_cd_queue_pc(ide_drive_t *drive, const unsigned char *cmd,
 		int error;
 
 		rq = blk_get_request(drive->queue, write, __GFP_WAIT);
+		blk_dump_rq_flags(rq, "ide_cd_queue_pc got rq");
 
 		memcpy(rq->cmd, cmd, BLK_MAX_CDB);
 		rq->cmd_type = REQ_TYPE_ATA_PC;
@@ -464,12 +465,14 @@  int ide_cd_queue_pc(ide_drive_t *drive, const unsigned char *cmd,
 		}
 
 		error = blk_execute_rq(drive->queue, info->disk, rq, 0);
+		blk_dump_rq_flags(rq, "ide_cd_queue_pc exec rq");
 
 		if (buffer)
 			*bufflen = rq->resid_len;
 
 		flags = rq->cmd_flags;
 		blk_put_request(rq);
+		blk_dump_rq_flags(rq, "ide_cd_queue_pc put rq");
 
 		/*
 		 * FIXME: we should probably abort/retry or something in case of
@@ -506,15 +509,23 @@  int ide_cd_queue_pc(ide_drive_t *drive, const unsigned char *cmd,
 	return (flags & REQ_FAILED) ? -EIO : 0;
 }
 
-static void ide_cd_error_cmd(ide_drive_t *drive, struct ide_cmd *cmd)
+/*
+ * notify callers that we ended the rq by returning a true value
+ */
+static bool ide_cd_error_cmd(ide_drive_t *drive, struct ide_cmd *cmd)
 {
 	unsigned int nr_bytes = cmd->nbytes - cmd->nleft;
 
 	if (cmd->tf_flags & IDE_TFLAG_WRITE)
 		nr_bytes -= cmd->last_xfer_len;
 
-	if (nr_bytes > 0)
+	if (nr_bytes > 0) {
+		blk_dump_rq_flags(drive->hwif->rq, "ide_cd_error_cmd completes rq");
 		ide_complete_rq(drive, 0, nr_bytes);
+		return true;
+	}
+
+	return false;
 }
 
 static ide_startstop_t cdrom_newpc_intr(ide_drive_t *drive)
@@ -552,8 +563,10 @@  static ide_startstop_t cdrom_newpc_intr(ide_drive_t *drive)
 	if (!OK_STAT(stat, 0, BAD_R_STAT)) {
 		rc = cdrom_decode_status(drive, stat);
 		if (rc) {
-			if (rc == 2)
+			if (rc == 2) {
+				printk(KERN_EMERG "%s: bad status with a sense rq: %p\n", __func__, rq);
 				goto out_end;
+			}
 			return ide_stopped;
 		}
 	}
@@ -667,8 +680,10 @@  out_end:
 		blk_end_request_all(rq, 0);
 		hwif->rq = NULL;
 	} else {
-		if (sense && uptodate)
+		if (sense && uptodate) {
+			printk(KERN_EMERG "%s: complete failed rq: %p\n", __func__, rq);
 			ide_cd_complete_failed_rq(drive, rq);
+		}
 
 		if (blk_fs_request(rq)) {
 			if (cmd->nleft == 0)
@@ -679,7 +694,10 @@  out_end:
 		}
 
 		if (uptodate == 0 && rq->bio)
-			ide_cd_error_cmd(drive, cmd);
+			if (ide_cd_error_cmd(drive, cmd)) {
+				printk(KERN_EMERG "ide_cd_error_cmd completes rq");
+				return ide_stopped;
+			}
 
 		/* make sure it's fully ended */
 		if (blk_fs_request(rq) == 0) {
@@ -688,10 +706,13 @@  out_end:
 				rq->resid_len += cmd->last_xfer_len;
 		}
 
+		printk(KERN_EMERG "%s: completing rq %p\n", __func__, rq);
 		ide_complete_rq(drive, uptodate ? 0 : -EIO, blk_rq_bytes(rq));
 
-		if (sense && rc == 2)
+		if (sense && rc == 2) {
+			printk(KERN_EMERG "%s: request sense failure, rq: %p\n", __func__, rq);
 			ide_error(drive, "request sense failure", stat);
+		}
 	}
 	return ide_stopped;
 }
@@ -1707,6 +1728,8 @@  static int ide_cd_probe(ide_drive_t *drive)
 	struct gendisk *g;
 	struct request_sense sense;
 
+	drive->debug_mask = 0xffffffff;
+
 	ide_debug_log(IDE_DBG_PROBE, "driver_req: %s, media: 0x%x",
 				     drive->driver_req, drive->media);
 
@@ -1716,7 +1739,6 @@  static int ide_cd_probe(ide_drive_t *drive)
 	if (drive->media != ide_cdrom && drive->media != ide_optical)
 		goto failed;
 
-	drive->debug_mask = debug_mask;
 	drive->irq_handler = cdrom_newpc_intr;
 
 	info = kzalloc(sizeof(struct cdrom_info), GFP_KERNEL);
diff --git a/drivers/ide/ide-cd.h b/drivers/ide/ide-cd.h
index 93a3cf1..613542a 100644
--- a/drivers/ide/ide-cd.h
+++ b/drivers/ide/ide-cd.h
@@ -8,7 +8,7 @@ 
 #include <linux/cdrom.h>
 #include <asm/byteorder.h>
 
-#define IDECD_DEBUG_LOG		0
+#define IDECD_DEBUG_LOG		1
 
 #if IDECD_DEBUG_LOG
 #define ide_debug_log(lvl, fmt, args...) __ide_debug_log(lvl, fmt, ## args)
diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index 172ac92..c522435 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -126,8 +126,13 @@  int ide_complete_rq(ide_drive_t *drive, int error, unsigned int nr_bytes)
 		nr_bytes = blk_rq_sectors(rq) << 9;
 
 	rc = ide_end_rq(drive, rq, error, nr_bytes);
-	if (rc == 0)
+	if (rc == 0) {
+		printk(KERN_EMERG "ide_complete_rq: no buffers pending for this rq");
 		hwif->rq = NULL;
+	}
+	else
+		blk_dump_rq_flags(rq, "still buffers pending for this rq");
+
 
 	return rc;
 }