Patchwork sky2: receive dma mapping error handling

login
register
mail settings
Submitter Jarek Poplawski
Date Jan. 31, 2010, 12:34 a.m.
Message ID <20100131003449.GA11935@del.dom.local>
Download mbox | patch
Permalink /patch/44101/
State RFC
Delegated to: David Miller
Headers show

Comments

Jarek Poplawski - Jan. 31, 2010, 12:34 a.m.
On Sat, Jan 30, 2010 at 11:31:48AM -0500, Michael Breuer wrote:
> On 01/28/2010 06:36 PM, Stephen Hemminger wrote:
> >Please try this patch (and only this patch), on 2.6.33-rc5[*];
> >none of the other patches that did not make it upstream because that
> >confuses things too much.
> >
> >The code that checks for DMA mapping errors on receive buffers would
> >not handle errors correctly.  I doubt you have these errors, but if you
> >did then it would explain the problems.  The code has to be a little
> >tricky and build mapping for new rx buffer before releasing old one,
> >that way if new mapping fails, the old one can be reused.
> >
> >If it works for you, I will resubmit with signed-off.
> >
> >-
> >
> Nope - tx crash again. This time the system stayed up (but hosed)
> for a few hours. When I tried to recover eth0 the system then
> crashed.
> 
> Brief summary of events (log extract below):
> 
> System start Jan 28 19:29
> Everything seemed good (load and all) until 17:13:11 the following
> day when I got rx errors:
> 
> Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x6230010
> length 1518
> Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x7f40010
> length 1518

These are length errors, but status shows more than 1518, e.g. 2036
here, unless I miss something. Please, don't use jumbo frames in your
network until we fully debug it for regular frames (Stephen admitted
sky2 jumbo might be broken).

...
> As I started looking at logs, the system hung and rebooted. I'm up
> now with dma debug enabled, however as with 2.6.32.4 num_entries is
> dropping and I don't think that dma debug will remain enabled long
> enough to catch a crash.

Could you try the patch below to show maybe some other users of
dma-debug entries?

Jarek P.
---

 lib/dma-debug.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 51 insertions(+), 1 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Breuer - Jan. 31, 2010, 4:17 a.m.
On 01/30/2010 07:34 PM, Jarek Poplawski wrote:
> On Sat, Jan 30, 2010 at 11:31:48AM -0500, Michael Breuer wrote:
>    
>> On 01/28/2010 06:36 PM, Stephen Hemminger wrote:
>>      
>>> Please try this patch (and only this patch), on 2.6.33-rc5[*];
>>> none of the other patches that did not make it upstream because that
>>> confuses things too much.
>>>
>>> The code that checks for DMA mapping errors on receive buffers would
>>> not handle errors correctly.  I doubt you have these errors, but if you
>>> did then it would explain the problems.  The code has to be a little
>>> tricky and build mapping for new rx buffer before releasing old one,
>>> that way if new mapping fails, the old one can be reused.
>>>
>>> If it works for you, I will resubmit with signed-off.
>>>
>>> -
>>>
>>>        
>> Nope - tx crash again. This time the system stayed up (but hosed)
>> for a few hours. When I tried to recover eth0 the system then
>> crashed.
>>
>> Brief summary of events (log extract below):
>>
>> System start Jan 28 19:29
>> Everything seemed good (load and all) until 17:13:11 the following
>> day when I got rx errors:
>>
>> Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x6230010
>> length 1518
>> Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x7f40010
>> length 1518
>>      
> These are length errors, but status shows more than 1518, e.g. 2036
> here, unless I miss something. Please, don't use jumbo frames in your
> network until we fully debug it for regular frames (Stephen admitted
> sky2 jumbo might be broken).
>    
MTU was 1500 - not using jumbo frames as they don't work.
> ...
>    
>> As I started looking at logs, the system hung and rebooted. I'm up
>> now with dma debug enabled, however as with 2.6.32.4 num_entries is
>> dropping and I don't think that dma debug will remain enabled long
>> enough to catch a crash.
>>      
> Could you try the patch below to show maybe some other users of
> dma-debug entries?
>
> Jarek P.
> ---
>    
Will do. Note that I'm running with the dma debug filter set to sky2.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Breuer - Jan. 31, 2010, 4:55 a.m.
On 01/30/2010 07:34 PM, Jarek Poplawski wrote:
>
> Could you try the patch below to show maybe some other users of
> dma-debug entries?
>
> Jarek P.
> ---
>
>    
With the default # entries & dma_debug_driver=sky2:

6:00 is eth0 & 4:00 is eth1.

Jan 30 23:53:14 mail kernel: DMA-API: 0000:06:00.0: entries: 31961
Jan 30 23:53:14 mail kernel: DMA-API: 0000:00:1b.0: entries: 15
Jan 30 23:53:14 mail kernel: DMA-API: 0000:04:00.0: entries: 744
Jan 30 23:53:14 mail kernel: DMA-API: 0000:00:1f.2: entries: 6
Jan 30 23:53:14 mail kernel: DMA-API: 0000:08:01.0: entries: 3
Jan 30 23:53:14 mail kernel: DMA-API: 0000:00:1d.1: entries: 6
Jan 30 23:53:14 mail kernel: DMA-API: 0000:08:02.0: entries: 8
Jan 30 23:53:14 mail kernel: DMA-API: 0000:00:1a.7: entries: 3
Jan 30 23:53:14 mail kernel: DMA-API: 0000:00:1d.7: entries: 3
Jan 30 23:53:14 mail kernel: DMA-API: 0000:00:1a.0: entries: 3
Jan 30 23:53:14 mail kernel: DMA-API: 0000:00:1a.1: entries: 3
Jan 30 23:53:14 mail kernel: DMA-API: 0000:00:1a.2: entries: 3
Jan 30 23:53:14 mail kernel: DMA-API: 0000:00:1d.0: entries: 3
Jan 30 23:53:14 mail kernel: DMA-API: 0000:00:1d.2: entries: 3
Jan 30 23:53:14 mail kernel: DMA-API: 0000:02:00.0: entries: 1
Jan 30 23:53:14 mail kernel: DMA-API: 0000:05:00.0: entries: 2
Jan 30 23:53:14 mail kernel: DMA-API: debugging out of memory - disabling

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Breuer - Jan. 31, 2010, 6:50 p.m.
On 1/30/2010 11:55 PM, Michael Breuer wrote:
> On 01/30/2010 07:34 PM, Jarek Poplawski wrote:
>>
>> Could you try the patch below to show maybe some other users of
>> dma-debug entries?
>>
>> Jarek P.
>> ---
>>
> With the default # entries & dma_debug_driver=sky2:
>
> 6:00 is eth0 & 4:00 is eth1.
>
> Jan 30 23:53:14 mail kernel: DMA-API: 0000:06:00.0: entries: 31961
> ...
>
I put a printk as a third else case in sky2_tx_unmap. Looks like the 
issue is that a large number (perhaps all) calls to sky2_tx_unmap have 
re->flags set to neither TX_MAP_SINGLE or TX_MAP_PAGE. Thus the elements 
are never being unmapped.

I suspect that the system collapses when using DMAR sooner than if not 
using DMAR. Probably some hardware limitation on the number of mapped 
elements that is less than the software limitation. I don't see at 
present how a ring element can ever get to this code without re->flags 
being set to one or the other.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Breuer - Jan. 31, 2010, 9:58 p.m.
On 1/31/2010 1:50 PM, Michael Breuer wrote:
> On 1/30/2010 11:55 PM, Michael Breuer wrote:
>> On 01/30/2010 07:34 PM, Jarek Poplawski wrote:
>>>
>>> Could you try the patch below to show maybe some other users of
>>> dma-debug entries?
>>>
>>> Jarek P.
>>> ---
>>>
>> With the default # entries & dma_debug_driver=sky2:
>>
>> 6:00 is eth0 & 4:00 is eth1.
>>
>> Jan 30 23:53:14 mail kernel: DMA-API: 0000:06:00.0: entries: 31961
>> ...
>>
> I put a printk as a third else case in sky2_tx_unmap. Looks like the 
> issue is that a large number (perhaps all) calls to sky2_tx_unmap have 
> re->flags set to neither TX_MAP_SINGLE or TX_MAP_PAGE. Thus the 
> elements are never being unmapped.
>
> I suspect that the system collapses when using DMAR sooner than if not 
> using DMAR. Probably some hardware limitation on the number of mapped 
> elements that is less than the software limitation. I don't see at 
> present how a ring element can ever get to this code without re->flags 
> being set to one or the other.
>
>
Put some more debugging code in... re->flags is always NULL upon entry 
to sky2_tx_unmap.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - Jan. 31, 2010, 10:25 p.m.
On Sat, Jan 30, 2010 at 11:17:41PM -0500, Michael Breuer wrote:
> On 01/30/2010 07:34 PM, Jarek Poplawski wrote:
> >On Sat, Jan 30, 2010 at 11:31:48AM -0500, Michael Breuer wrote:
> >>Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x6230010
> >>length 1518
> >>Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x7f40010
> >>length 1518
> >These are length errors, but status shows more than 1518, e.g. 2036
> >here, unless I miss something. Please, don't use jumbo frames in your
> >network until we fully debug it for regular frames (Stephen admitted
> >sky2 jumbo might be broken).
> MTU was 1500 - not using jumbo frames as they don't work.

Do you mean no NIC in your network could have sent such frames?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Breuer - Jan. 31, 2010, 11:58 p.m.
On 1/31/2010 5:25 PM, Jarek Poplawski wrote:
> On Sat, Jan 30, 2010 at 11:17:41PM -0500, Michael Breuer wrote:
>    
>> On 01/30/2010 07:34 PM, Jarek Poplawski wrote:
>>      
>>> On Sat, Jan 30, 2010 at 11:31:48AM -0500, Michael Breuer wrote:
>>>        
>>>> Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x6230010
>>>> length 1518
>>>> Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x7f40010
>>>> length 1518
>>>>          
>>> These are length errors, but status shows more than 1518, e.g. 2036
>>> here, unless I miss something. Please, don't use jumbo frames in your
>>> network until we fully debug it for regular frames (Stephen admitted
>>> sky2 jumbo might be broken).
>>>        
>> MTU was 1500 - not using jumbo frames as they don't work.
>>      
> Do you mean no NIC in your network could have sent such frames?
>
> Jarek P.
>    
Well... There's only one possible source... and if there were it would 
have been a Win7 bug :) Regardless, sky2 shouldn't be sensitive to rogue 
external network stuff.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 7d2f0b3..e2dcc9c 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -310,6 +310,53 @@  static void hash_bucket_del(struct dma_debug_entry *entry)
 	list_del(&entry->list);
 }
 
+struct dma_debug_dev {
+	struct device *dev;
+	unsigned int cnt;
+};
+
+#define DMA_DEBUG_DEVS 100
+static struct dma_debug_dev dma_debug_devs[DMA_DEBUG_DEVS];
+
+static void debug_dma_dump_devs(void)
+{
+	int idx, i;
+
+	memset(dma_debug_devs, 0, sizeof(struct dma_debug_dev) * DMA_DEBUG_DEVS);
+
+	for (idx = 0; idx < HASH_SIZE; idx++) {
+		struct hash_bucket *bucket = &dma_entry_hash[idx];
+		struct dma_debug_entry *entry;
+		unsigned long flags;
+
+		spin_lock_irqsave(&bucket->lock, flags);
+
+		list_for_each_entry(entry, &bucket->list, list) {
+			for (i = 0; i < DMA_DEBUG_DEVS; i++) {
+				struct device *dev = dma_debug_devs[i].dev;
+
+				if (!dev || dev == entry->dev) {
+					dma_debug_devs[i].dev = entry->dev;
+					dma_debug_devs[i].cnt++;
+					break;
+				}
+			}
+		}
+
+		spin_unlock_irqrestore(&bucket->lock, flags);
+	}
+
+	for (i = 0; i < DMA_DEBUG_DEVS; i++) {
+		struct device *dev = dma_debug_devs[i].dev;
+
+		if (!dev)
+			break;
+
+		pr_info("DMA-API: %s: entries: %d\n", dev_name(dev),
+			dma_debug_devs[i].cnt);
+	}
+}
+
 /*
  * Dump mapping entries for debugging purposes
  */
@@ -363,8 +410,11 @@  static struct dma_debug_entry *__dma_entry_alloc(void)
 	memset(entry, 0, sizeof(*entry));
 
 	num_free_entries -= 1;
-	if (num_free_entries < min_free_entries)
+	if (num_free_entries < min_free_entries) {
 		min_free_entries = num_free_entries;
+		if ((min_free_entries & 0xffff) == 0)
+			debug_dma_dump_devs();
+	}
 
 	return entry;
 }