Patchwork sky2: receive dma mapping error handling

login
register
mail settings
Submitter Jarek Poplawski
Date Jan. 31, 2010, 10:18 p.m.
Message ID <20100131221835.GA3317@del.dom.local>
Download mbox | patch
Permalink /patch/44127/
State RFC
Delegated to: David Miller
Headers show

Comments

Jarek Poplawski - Jan. 31, 2010, 10:18 p.m.
On Sun, Jan 31, 2010 at 04:58:42PM -0500, Michael Breuer wrote:
> On 1/31/2010 1:50 PM, Michael Breuer wrote:
> >On 1/30/2010 11:55 PM, Michael Breuer wrote:
> >>On 01/30/2010 07:34 PM, Jarek Poplawski wrote:
> >>>
> >>>Could you try the patch below to show maybe some other users of
> >>>dma-debug entries?
> >>>
> >>>Jarek P.
> >>>---
> >>>
> >>With the default # entries & dma_debug_driver=sky2:
> >>
> >>6:00 is eth0 & 4:00 is eth1.
> >>
> >>Jan 30 23:53:14 mail kernel: DMA-API: 0000:06:00.0: entries: 31961
> >>...
> >>
> >I put a printk as a third else case in sky2_tx_unmap. Looks like
> >the issue is that a large number (perhaps all) calls to
> >sky2_tx_unmap have re->flags set to neither TX_MAP_SINGLE or
> >TX_MAP_PAGE. Thus the elements are never being unmapped.
> >
> >I suspect that the system collapses when using DMAR sooner than if
> >not using DMAR. Probably some hardware limitation on the number of
> >mapped elements that is less than the software limitation. I don't
> >see at present how a ring element can ever get to this code
> >without re->flags being set to one or the other.
> >
> >
> Put some more debugging code in... re->flags is always NULL upon
> entry to sky2_tx_unmap.
> 

Yes, good point! Could you try if this patch can fix it. (not compiled)

Thanks,
Jarek P.
---

 drivers/net/sky2.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Breuer - Feb. 1, 2010, 12:19 a.m.
On 1/31/2010 5:18 PM, Jarek Poplawski wrote:
> On Sun, Jan 31, 2010 at 04:58:42PM -0500, Michael Breuer wrote:
>    
>> On 1/31/2010 1:50 PM, Michael Breuer wrote:
>>      
>>> On 1/30/2010 11:55 PM, Michael Breuer wrote:
>>>        
>>>> On 01/30/2010 07:34 PM, Jarek Poplawski wrote:
>>>>          
>>>>> Could you try the patch below to show maybe some other users of
>>>>> dma-debug entries?
>>>>>
>>>>> Jarek P.
>>>>> ---
>>>>>
>>>>>            
>>>> With the default # entries&  dma_debug_driver=sky2:
>>>>
>>>> 6:00 is eth0&  4:00 is eth1.
>>>>
>>>> Jan 30 23:53:14 mail kernel: DMA-API: 0000:06:00.0: entries: 31961
>>>> ...
>>>>
>>>>          
>>> I put a printk as a third else case in sky2_tx_unmap. Looks like
>>> the issue is that a large number (perhaps all) calls to
>>> sky2_tx_unmap have re->flags set to neither TX_MAP_SINGLE or
>>> TX_MAP_PAGE. Thus the elements are never being unmapped.
>>>
>>> I suspect that the system collapses when using DMAR sooner than if
>>> not using DMAR. Probably some hardware limitation on the number of
>>> mapped elements that is less than the software limitation. I don't
>>> see at present how a ring element can ever get to this code
>>> without re->flags being set to one or the other.
>>>
>>>
>>>        
>> Put some more debugging code in... re->flags is always NULL upon
>> entry to sky2_tx_unmap.
>>
>>      
> Yes, good point! Could you try if this patch can fix it. (not compiled)
>
> Thanks,
> Jarek P.
> ---
>
>   drivers/net/sky2.c |   10 +++++++---
>   1 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
> index d760650..3437917 100644
> --- a/drivers/net/sky2.c
> +++ b/drivers/net/sky2.c
> @@ -1025,9 +1025,10 @@ static void sky2_prefetch_init(struct sky2_hw *hw, u32 qaddr,
>   static inline struct sky2_tx_le *get_tx_le(struct sky2_port *sky2, u16 *slot)
>   {
>   	struct sky2_tx_le *le = sky2->tx_le + *slot;
> -	struct tx_ring_info *re = sky2->tx_ring + *slot;
> +	struct tx_ring_info *re;
>
>   	*slot = RING_NEXT(*slot, sky2->tx_ring_size);
> +	re = sky2->tx_ring + *slot;
>   	re->flags = 0;
>   	re->skb = NULL;
>   	le->ctrl = 0;
> @@ -1036,13 +1037,16 @@ static inline struct sky2_tx_le *get_tx_le(struct sky2_port *sky2, u16 *slot)
>
>   static void tx_init(struct sky2_port *sky2)
>   {
> -	struct sky2_tx_le *le;
> +	struct sky2_tx_le *le = sky2->tx_le;
> +	struct tx_ring_info *re = sky2->tx_ring;
>
>   	sky2->tx_prod = sky2->tx_cons = 0;
>   	sky2->tx_tcpsum = 0;
>   	sky2->tx_last_mss = 0;
>
> -	le = get_tx_le(sky2,&sky2->tx_prod);
> +	re->flags = 0;
> +	re->skb = NULL;
> +	le->ctrl = 0;
>   	le->addr = 0;
>   	le->opcode = OP_ADDR64 | HW_OWNER;
>   	sky2->tx_last_upper = 0;
>    
Ok- solves the dma-debug issue - i.e., elements are now being unmapped.

Will leave up and hit with traffic unless a crash occurs. If I hit 
something unrelated I'll backport to 2.6.32.7 and try that for a while. 
I do think it's plausible that the dma errors after (during) load were 
due to hardware limitations on the number of mapped entries (haven't 
researched what that limit was). I would also assume that the sw map 
would also have failed eventually.

I'd suggest that regardless of whether this patch solves my crash that 
it ought to be backported as it seems unlikely that any machine would be 
able to survive for long without the tx entries being unmapped.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Breuer - Feb. 1, 2010, 4:26 a.m.
On 1/31/2010 7:19 PM, Michael Breuer wrote:
> On 1/31/2010 5:18 PM, Jarek Poplawski wrote:
> solves the dma-debug issue - i.e., elements are now being unmapped.
>
> Will leave up and hit with traffic unless a crash occurs. If I hit 
> something unrelated I'll backport to 2.6.32.7 and try that for a 
> while. I do think it's plausible that the dma errors after (during) 
> load were due to hardware limitations on the number of mapped entries 
> (haven't researched what that limit was). I would also assume that the 
> sw map would also have failed eventually.
>
> I'd suggest that regardless of whether this patch solves my crash that 
> it ought to be backported as it seems unlikely that any machine would 
> be able to survive for long without the tx entries being unmapped.
>
FYI - tried generating lots of extra tx traffic... found a way to 
generate the rx status messages on demand:
     ping -i .0000001 -s 8000 -t 2 <host> >/dev/null

Yields:
Jan 31 23:08:07 mail kernel: sky2 eth0: rx error, status 0x1f6a0010 
length 1518
Jan 31 23:08:07 mail kernel: sky2 eth0: rx error, status 0x1f6a0010 
length 1518
Jan 31 23:08:07 mail kernel: sky2 eth0: rx error, status 0x1f6a0010 
length 1518
Jan 31 23:08:07 mail kernel: sky2 eth0: rx error, status 0x1f6a0010 
length 1518
Jan 31 23:08:07 mail kernel: sky2 eth0: rx error, status 0x1f6a0010 
length 1518
Jan 31 23:08:07 mail kernel: sky2 eth0: rx error, status 0x1f6a0010 
length 1518
Jan 31 23:08:07 mail kernel: sky2 eth0: rx error, status 0x1f6a0010 
length 1518
Jan 31 23:08:07 mail kernel: sky2 eth0: rx error, status 0x1f6a0010 
length 1518
Jan 31 23:08:07 mail kernel: sky2 eth0: rx error, status 0x1f6a0010 
length 1518
Jan 31 23:08:07 mail kernel: sky2 eth0: rx error, status 0x1f6a0010 
length 1518
Jan 31 23:08:12 mail kernel: net_ratelimit: 316 callbacks suppressed
etc.
Looking at the packet trace, it seems that my Windows7 box is under 
*some* circumstances not observing the MTU. In this case, the ICMP reply 
is going back with the 8000 byte jumbo frame unfragmented. It seems that 
the reverse is also true. I don't know why sometimes win7 does this, and 
at other times properly fragments.

Oddly, prior to this attempt if I set no fragment on a ping from the 
windows box back to the linux box and a size of > mtu (like 8000), the 
ping failed. Absent the no-fragment flag, the ping properly fragmented. 
I am not sure why Windows now thinks the MTU is > 1500. I'll look into 
that when I have some time. It's possible that with 2.6.33-rc5 & the 
patches I've got that somehow path mtu discovery is broken as nothing 
changed on the windows side.

Understanding that the other side is out of spec, I'd still wonder why 
the sky2 driver generates rx errors. Perhaps overruns should be tossed 
silently... by the hardware if possible.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - Feb. 1, 2010, 10:47 a.m.
On Sun, Jan 31, 2010 at 11:26:03PM -0500, Michael Breuer wrote:
> FYI - tried generating lots of extra tx traffic... found a way to  
> generate the rx status messages on demand:
>     ping -i .0000001 -s 8000 -t 2 <host> >/dev/null
>
> Yields:
> Jan 31 23:08:07 mail kernel: sky2 eth0: rx error, status 0x1f6a0010  
> length 1518
...
> Jan 31 23:08:12 mail kernel: net_ratelimit: 316 callbacks suppressed
> etc.
...
> Understanding that the other side is out of spec, I'd still wonder why  
> the sky2 driver generates rx errors. Perhaps overruns should be tossed  
> silently... by the hardware if possible.

Of course it's a matter of taste, but it seems such errors shouldn't
be tolerated in a local network. I'd rather prefer doing them more
explicit (like e.g. some other kind of length errors).

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Hemminger - Feb. 1, 2010, 6:08 p.m.
On Sun, 31 Jan 2010 23:18:35 +0100
Jarek Poplawski <jarkao2@gmail.com> wrote:

> @@ -1025,9 +1025,10 @@ static void sky2_prefetch_init(struct sky2_hw *hw, u32 qaddr,
>  static inline struct sky2_tx_le *get_tx_le(struct sky2_port *sky2, u16 *slot)
>  {
>  	struct sky2_tx_le *le = sky2->tx_le + *slot;
> -	struct tx_ring_info *re = sky2->tx_ring + *slot;
> +	struct tx_ring_info *re;
>  
>  	*slot = RING_NEXT(*slot, sky2->tx_ring_size);
> +	re = sky2->tx_ring + *slot;
>  	re->flags = 0;

Bogus, le and re are 1-to-1, so hardware portion an software
portion should be at same index.

Patch

diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
index d760650..3437917 100644
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -1025,9 +1025,10 @@  static void sky2_prefetch_init(struct sky2_hw *hw, u32 qaddr,
 static inline struct sky2_tx_le *get_tx_le(struct sky2_port *sky2, u16 *slot)
 {
 	struct sky2_tx_le *le = sky2->tx_le + *slot;
-	struct tx_ring_info *re = sky2->tx_ring + *slot;
+	struct tx_ring_info *re;
 
 	*slot = RING_NEXT(*slot, sky2->tx_ring_size);
+	re = sky2->tx_ring + *slot;
 	re->flags = 0;
 	re->skb = NULL;
 	le->ctrl = 0;
@@ -1036,13 +1037,16 @@  static inline struct sky2_tx_le *get_tx_le(struct sky2_port *sky2, u16 *slot)
 
 static void tx_init(struct sky2_port *sky2)
 {
-	struct sky2_tx_le *le;
+	struct sky2_tx_le *le = sky2->tx_le;
+	struct tx_ring_info *re = sky2->tx_ring;
 
 	sky2->tx_prod = sky2->tx_cons = 0;
 	sky2->tx_tcpsum = 0;
 	sky2->tx_last_mss = 0;
 
-	le = get_tx_le(sky2, &sky2->tx_prod);
+	re->flags = 0;
+	re->skb = NULL;
+	le->ctrl = 0;
 	le->addr = 0;
 	le->opcode = OP_ADDR64 | HW_OWNER;
 	sky2->tx_last_upper = 0;