diff mbox

[net-next-2.6] net: reinject arps into bonding slave instead of master

Message ID 20110308071350.GA2826@psychotron.redhat.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Jiri Pirko March 8, 2011, 7:13 a.m. UTC
Mon, Mar 07, 2011 at 11:43:38PM CET, andy@greyhouse.net wrote:
>On Mon, Mar 07, 2011 at 01:51:00PM +0100, Jiri Pirko wrote:
>> Recent patch "bonding: move processing of recv handlers into
>> handle_frame()" caused a regression on following net scheme:
>> 
>> eth0 - bond0 - bond0.5
>> 
>> where arp monitoring is happening over vlan. This patch fixes it by
>> reinjecting the arp packet into bonding slave device so the bonding
>> rx_handler can pickup and process it.
>> 
>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>> ---
>>  net/core/dev.c |    8 ++++----
>>  1 files changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index c71bd18..3d88458 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3094,12 +3094,12 @@ void netdev_rx_handler_unregister(struct net_device *dev)
>>  }
>>  EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
>>  
>> -static void vlan_on_bond_hook(struct sk_buff *skb)
>> +static void vlan_on_bond_hook(struct sk_buff *skb, struct net_device *orig_dev)
>>  {
>>  	/*
>>  	 * Make sure ARP frames received on VLAN interfaces stacked on
>>  	 * bonding interfaces still make their way to any base bonding
>> -	 * device that may have registered for a specific ptype.
>> +	 * device by reinjecting the frame into bonding slave (orig_dev)
>>  	 */
>>  	if (skb->dev->priv_flags & IFF_802_1Q_VLAN &&
>>  	    vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING &&
>> @@ -3108,7 +3108,7 @@ static void vlan_on_bond_hook(struct sk_buff *skb)
>>  
>>  		if (!skb2)
>>  			return;
>> -		skb2->dev = vlan_dev_real_dev(skb->dev);
>> +		skb2->dev = orig_dev;
>>  		netif_rx(skb2);
>>  	}
>>  }
>> @@ -3202,7 +3202,7 @@ ncls:
>>  			goto out;
>>  	}
>>  
>> -	vlan_on_bond_hook(skb);
>> +	vlan_on_bond_hook(skb, orig_dev);
>>  
>>  	/* deliver only exact match when indicated */
>>  	null_or_dev = deliver_exact ? skb->dev : NULL;
>
>This patch doesn't work.
>
>My setup has bond0.100 -> bond0 -> eth2 and eth3.  ARP monitoring is
>enabled as is arp_valiate.
>
>The initial problem was just that just before vlan_on_bond_hook is
>called, skb->dev = bond0.100 and orig_dev = eth2.   (This is after
>running goto another_route and having been called back through
>__netif_receive_skb since vlan_hwaccel_do_receive it true.)
>
>Now vlan_on_bond_hook is called and we have 2 skbs.
>
>The original skb still have skb->dev = bond0.100 and orig_dev = eth2.
>Since bond_arp_rcv is registered for traffic only to bond0, the handler
>is not hit and the frame is dropped (or processed by another handler).
>
>The cloned skb has skb->dev = bond0 and is put back on the receive queue
>and comes back through __netif_receive_skb.  This frame will match the
>ptype entry for bond_arp_rcv, but since orig_dev = bond0 in this case,
>the code in bond_arp_rcv will not handle the frame.  
>
>If we truly want to track the original interface that received the
>frame, the following is a better option.  With the recursive nature of
>__netif_receive_skb at this point, we should really consider setting
>orig_dev from skb_iif rather than just from skb->dev.
>
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 30440e7..500fdbc 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -3135,7 +3135,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
> 
> 	if (!skb->skb_iif)
> 		skb->skb_iif = skb->dev->ifindex;
>-	orig_dev = skb->dev;
> 
> 	skb_reset_network_header(skb);
> 	skb_reset_transport_header(skb);
>@@ -3145,6 +3144,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
> 
> 	rcu_read_lock();
> 
>+	orig_dev = dev_get_by_index_rcu(dev_net(skb->dev),skb->skb_iif);
> another_round:
> 
> 	__this_cpu_inc(softnet_data.processed);
>

This was proposed earlier. people did not like this very much :(

I forgot to include crucial part of "goto another_round for vlan".
Following patch should work (will test it once I get to work):

Subject: [patch net-next 2.6] net: reinject arps into bonding slave instead of master

Recent patch "bonding: move processing of recv handlers into
handle_frame()" caused a regression on following net scheme:

eth0 - bond0 - bond0.5

where arp monitoring is happening over vlan. This patch fixes it by
reinjecting the arp packet into bonding slave device so the bonding
rx_handler can pickup and process it.

also instead of calling __netif_receive_skb recursively, "goto another
round" does this recursion. The point is the orig_dev variable remains
intact.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 net/core/dev.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

Comments

Andy Gospodarek March 8, 2011, 1:42 p.m. UTC | #1
On Tue, Mar 08, 2011 at 08:13:51AM +0100, Jiri Pirko wrote:
> Mon, Mar 07, 2011 at 11:43:38PM CET, andy@greyhouse.net wrote:
[...]
> >
> >This patch doesn't work.
> >
> >My setup has bond0.100 -> bond0 -> eth2 and eth3.  ARP monitoring is
> >enabled as is arp_valiate.
> >
> >The initial problem was just that just before vlan_on_bond_hook is
> >called, skb->dev = bond0.100 and orig_dev = eth2.   (This is after
> >running goto another_route and having been called back through
> >__netif_receive_skb since vlan_hwaccel_do_receive it true.)
> >
> >Now vlan_on_bond_hook is called and we have 2 skbs.
> >
> >The original skb still have skb->dev = bond0.100 and orig_dev = eth2.
> >Since bond_arp_rcv is registered for traffic only to bond0, the handler
> >is not hit and the frame is dropped (or processed by another handler).
> >
> >The cloned skb has skb->dev = bond0 and is put back on the receive queue
> >and comes back through __netif_receive_skb.  This frame will match the
> >ptype entry for bond_arp_rcv, but since orig_dev = bond0 in this case,
> >the code in bond_arp_rcv will not handle the frame.  
> >
> >If we truly want to track the original interface that received the
> >frame, the following is a better option.  With the recursive nature of
> >__netif_receive_skb at this point, we should really consider setting
> >orig_dev from skb_iif rather than just from skb->dev.
> >
> >diff --git a/net/core/dev.c b/net/core/dev.c
> >index 30440e7..500fdbc 100644
> >--- a/net/core/dev.c
> >+++ b/net/core/dev.c
> >@@ -3135,7 +3135,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
> > 
> > 	if (!skb->skb_iif)
> > 		skb->skb_iif = skb->dev->ifindex;
> >-	orig_dev = skb->dev;
> > 
> > 	skb_reset_network_header(skb);
> > 	skb_reset_transport_header(skb);
> >@@ -3145,6 +3144,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
> > 
> > 	rcu_read_lock();
> > 
> >+	orig_dev = dev_get_by_index_rcu(dev_net(skb->dev),skb->skb_iif);
> > another_round:
> > 
> > 	__this_cpu_inc(softnet_data.processed);
> >
> 
> This was proposed earlier. people did not like this very much :(
> 

Interesting.  I'll have to look back at the discussion.

> I forgot to include crucial part of "goto another_round for vlan".
> Following patch should work (will test it once I get to work):
> 
> Subject: [patch net-next 2.6] net: reinject arps into bonding slave instead of master
> 
> Recent patch "bonding: move processing of recv handlers into
> handle_frame()" caused a regression on following net scheme:
> 
> eth0 - bond0 - bond0.5
> 
> where arp monitoring is happening over vlan. This patch fixes it by
> reinjecting the arp packet into bonding slave device so the bonding
> rx_handler can pickup and process it.
> 
> also instead of calling __netif_receive_skb recursively, "goto another
> round" does this recursion. The point is the orig_dev variable remains
> intact.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> ---
>  net/core/dev.c |   11 +++++------
>  1 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index c71bd18..ec330e1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3094,12 +3094,12 @@ void netdev_rx_handler_unregister(struct net_device *dev)
>  }
>  EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
>  
> -static void vlan_on_bond_hook(struct sk_buff *skb)
> +static void vlan_on_bond_hook(struct sk_buff *skb, struct net_device *orig_dev)
>  {
>  	/*
>  	 * Make sure ARP frames received on VLAN interfaces stacked on
>  	 * bonding interfaces still make their way to any base bonding
> -	 * device that may have registered for a specific ptype.
> +	 * device by reinjecting the frame into bonding slave (orig_dev)
>  	 */
>  	if (skb->dev->priv_flags & IFF_802_1Q_VLAN &&
>  	    vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING &&
> @@ -3108,7 +3108,7 @@ static void vlan_on_bond_hook(struct sk_buff *skb)
>  
>  		if (!skb2)
>  			return;
> -		skb2->dev = vlan_dev_real_dev(skb->dev);
> +		skb2->dev = orig_dev;
>  		netif_rx(skb2);
>  	}
> 

I'm pretty sure this patch will have the same catastrophic problem your
last one did.  By cloning and setting skb2->dev = orig_dev you just
inserted a frame identical to the one we received right back into the
stack.  It only took a few minutes for my box to melt as one frame on
the wire will cause an infinite number of frames to be received by the
stack.

> @@ -3196,13 +3196,12 @@ ncls:
>  			pt_prev = NULL;
>  		}
>  		if (vlan_hwaccel_do_receive(&skb)) {
> -			ret = __netif_receive_skb(skb);
> -			goto out;
> +			goto another_round;
>  		} else if (unlikely(!skb))
>  			goto out;
>  	}
>  
> -	vlan_on_bond_hook(skb);
> +	vlan_on_bond_hook(skb, orig_dev);
>  
>  	/* deliver only exact match when indicated */
>  	null_or_dev = deliver_exact ? skb->dev : NULL;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan March 8, 2011, 9:44 p.m. UTC | #2
Le 08/03/2011 14:42, Andy Gospodarek a écrit :
> I'm pretty sure this patch will have the same catastrophic problem your
> last one did.  By cloning and setting skb2->dev = orig_dev you just
> inserted a frame identical to the one we received right back into the
> stack.  It only took a few minutes for my box to melt as one frame on
> the wire will cause an infinite number of frames to be received by the
> stack.

I agree with Andy. We still keep one reinject (netif_rx), which is probably better that two 
(__netif_receive_skb), but not enough.

I really think we need a general framework for late delivery of final packets to packet handler 
registered somewhere in the rx_handler path.

Jiri, is this patch the one you announced as "I have some kind nice solution in mind and I'm going 
to submit that as a patch later (too many patches are in the wind atm)" ?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko March 9, 2011, 7:45 a.m. UTC | #3
Tue, Mar 08, 2011 at 10:44:37PM CET, nicolas.2p.debian@gmail.com wrote:
>Le 08/03/2011 14:42, Andy Gospodarek a écrit :
>>I'm pretty sure this patch will have the same catastrophic problem your
>>last one did.  By cloning and setting skb2->dev = orig_dev you just
>>inserted a frame identical to the one we received right back into the
>>stack.  It only took a few minutes for my box to melt as one frame on
>>the wire will cause an infinite number of frames to be received by the
>>stack.
>
>I agree with Andy. We still keep one reinject (netif_rx), which is
>probably better that two (__netif_receive_skb), but not enough.
>
>I really think we need a general framework for late delivery of final
>packets to packet handler registered somewhere in the rx_handler
>path.
>
>Jiri, is this patch the one you announced as "I have some kind nice
>solution in mind and I'm going to submit that as a patch later (too
>many patches are in the wind atm)" ?


I did not had time to verify my thought yet but I think that the only
think needed against my original patch (bonding: move processing of recv
handlers into handle_frame()) is ro remove vlan_on_bond_hook, period.

Because all incoming arps are seen by bond_handle_frame =>
bond->recv_probe , even vlan ones - that would make eth0-bond0-bond0.5
work and eth0-bond0-br0-bond0.5 as well. But again, need to verify this.

Jirka
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman March 9, 2011, 1:33 p.m. UTC | #4
On Tue, Mar 08, 2011 at 10:44:37PM +0100, Nicolas de Pesloüan wrote:
> Le 08/03/2011 14:42, Andy Gospodarek a écrit :
> >I'm pretty sure this patch will have the same catastrophic problem your
> >last one did.  By cloning and setting skb2->dev = orig_dev you just
> >inserted a frame identical to the one we received right back into the
> >stack.  It only took a few minutes for my box to melt as one frame on
> >the wire will cause an infinite number of frames to be received by the
> >stack.
> 
> I agree with Andy. We still keep one reinject (netif_rx), which is
> probably better that two (__netif_receive_skb), but not enough.
> 
> I really think we need a general framework for late delivery of
> final packets to packet handler registered somewhere in the
> rx_handler path.
> 
> Jiri, is this patch the one you announced as "I have some kind nice
> solution in mind and I'm going to submit that as a patch later (too
> many patches are in the wind atm)" ?


I've not had time to flesh any of this out, but have you considered the use of
privately allocated net namespaces to define the receive order of frames when
doing multiple re-injections through netif_rx/netif_receive_skb.  If you
modified the ptype list traversals so validate that dev_net(ptype->dev) ==
dev_net(skb->dev) before delivery, you could allocate private net namespaces in
your virutal master drivers (bridges/bonds/vlans) and place the slave devices in
those spaces, so that only the parent device could receive frames on them.

Of course thats going to have its own set of problems to deal with, but it might
be worth considering.
Neil

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan March 9, 2011, 2:49 p.m. UTC | #5
Le 09/03/2011 08:45, Jiri Pirko a écrit :
> Tue, Mar 08, 2011 at 10:44:37PM CET, nicolas.2p.debian@gmail.com wrote:
>> Le 08/03/2011 14:42, Andy Gospodarek a écrit :
>>> I'm pretty sure this patch will have the same catastrophic problem your
>>> last one did.  By cloning and setting skb2->dev = orig_dev you just
>>> inserted a frame identical to the one we received right back into the
>>> stack.  It only took a few minutes for my box to melt as one frame on
>>> the wire will cause an infinite number of frames to be received by the
>>> stack.
>>
>> I agree with Andy. We still keep one reinject (netif_rx), which is
>> probably better that two (__netif_receive_skb), but not enough.
>>
>> I really think we need a general framework for late delivery of final
>> packets to packet handler registered somewhere in the rx_handler
>> path.
>>
>> Jiri, is this patch the one you announced as "I have some kind nice
>> solution in mind and I'm going to submit that as a patch later (too
>> many patches are in the wind atm)" ?
>
>
> I did not had time to verify my thought yet but I think that the only
> think needed against my original patch (bonding: move processing of recv
> handlers into handle_frame()) is ro remove vlan_on_bond_hook, period.
>
> Because all incoming arps are seen by bond_handle_frame =>
> bond->recv_probe , even vlan ones - that would make eth0-bond0-bond0.5
> work and eth0-bond0-br0-bond0.5 as well. But again, need to verify this.

All incoming ARPs are seen by bond_handle_frame, even vlan ones, definitely true.

But as some of them *are* vlan ones, you will have to untag those "by hand" inside bond_handle_frame.

It might work, but I wouldn't support the idea, for two reasons :

- It would duplicate code, or at least, duplicate places where untagging happens.
- It would cause some vlan hacks to sit inside bond, which is not nice from a layering point of 
view. You recently advocated against breaking layering and you are right. We should avoid putting X 
related stuff into Y part of the network stack, as much as possible.

Again, I think we should try and find a generic way to have the final frame delivered to whoever 
needs it. And this way should not be limited to "vlan untagging" nor to bonding.

	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko March 9, 2011, 3:09 p.m. UTC | #6
Wed, Mar 09, 2011 at 03:49:49PM CET, nicolas.2p.debian@gmail.com wrote:
>Le 09/03/2011 08:45, Jiri Pirko a écrit :
>>Tue, Mar 08, 2011 at 10:44:37PM CET, nicolas.2p.debian@gmail.com wrote:
>>>Le 08/03/2011 14:42, Andy Gospodarek a écrit :
>>>>I'm pretty sure this patch will have the same catastrophic problem your
>>>>last one did.  By cloning and setting skb2->dev = orig_dev you just
>>>>inserted a frame identical to the one we received right back into the
>>>>stack.  It only took a few minutes for my box to melt as one frame on
>>>>the wire will cause an infinite number of frames to be received by the
>>>>stack.
>>>
>>>I agree with Andy. We still keep one reinject (netif_rx), which is
>>>probably better that two (__netif_receive_skb), but not enough.
>>>
>>>I really think we need a general framework for late delivery of final
>>>packets to packet handler registered somewhere in the rx_handler
>>>path.
>>>
>>>Jiri, is this patch the one you announced as "I have some kind nice
>>>solution in mind and I'm going to submit that as a patch later (too
>>>many patches are in the wind atm)" ?
>>
>>
>>I did not had time to verify my thought yet but I think that the only
>>think needed against my original patch (bonding: move processing of recv
>>handlers into handle_frame()) is ro remove vlan_on_bond_hook, period.
>>
>>Because all incoming arps are seen by bond_handle_frame =>
>>bond->recv_probe , even vlan ones - that would make eth0-bond0-bond0.5
>>work and eth0-bond0-br0-bond0.5 as well. But again, need to verify this.
>
>All incoming ARPs are seen by bond_handle_frame, even vlan ones, definitely true.
>
>But as some of them *are* vlan ones, you will have to untag those "by hand" inside bond_handle_frame.

Hmm. For hw vlan accel this would not be needed. But for
non-hw-vlan-accel the frame is wrapped when going thru handle_frame.
And yes, in that case untagging would be necessary. Unless the vlan
untagging happening now in ptype_base handler is moved in rx path
somewhere before __netif_receive_skb. That would result in two things:

1) Bond would be able to scope vlan packets
2) The rx path for hw-vlan-accel and non-hw-vlan-accel would be the same
   (should be desirable + one reinjection would be avoided for
    non-hw-vlan-accel)
>
>It might work, but I wouldn't support the idea, for two reasons :
>
>- It would duplicate code, or at least, duplicate places where untagging happens.
>- It would cause some vlan hacks to sit inside bond, which is not
>nice from a layering point of view. You recently advocated against
>breaking layering and you are right. We should avoid putting X
>related stuff into Y part of the network stack, as much as possible.
>
>Again, I think we should try and find a generic way to have the final
>frame delivered to whoever needs it. And this way should not be
>limited to "vlan untagging" nor to bonding.
>
>	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan March 9, 2011, 3:28 p.m. UTC | #7
Le 09/03/2011 16:09, Jiri Pirko a écrit :
> Wed, Mar 09, 2011 at 03:49:49PM CET, nicolas.2p.debian@gmail.com wrote:
>> Le 09/03/2011 08:45, Jiri Pirko a écrit :
>>> Tue, Mar 08, 2011 at 10:44:37PM CET, nicolas.2p.debian@gmail.com wrote:
>>>> Le 08/03/2011 14:42, Andy Gospodarek a écrit :
>>>>> I'm pretty sure this patch will have the same catastrophic problem your
>>>>> last one did.  By cloning and setting skb2->dev = orig_dev you just
>>>>> inserted a frame identical to the one we received right back into the
>>>>> stack.  It only took a few minutes for my box to melt as one frame on
>>>>> the wire will cause an infinite number of frames to be received by the
>>>>> stack.
>>>>
>>>> I agree with Andy. We still keep one reinject (netif_rx), which is
>>>> probably better that two (__netif_receive_skb), but not enough.
>>>>
>>>> I really think we need a general framework for late delivery of final
>>>> packets to packet handler registered somewhere in the rx_handler
>>>> path.
>>>>
>>>> Jiri, is this patch the one you announced as "I have some kind nice
>>>> solution in mind and I'm going to submit that as a patch later (too
>>>> many patches are in the wind atm)" ?
>>>
>>>
>>> I did not had time to verify my thought yet but I think that the only
>>> think needed against my original patch (bonding: move processing of recv
>>> handlers into handle_frame()) is ro remove vlan_on_bond_hook, period.
>>>
>>> Because all incoming arps are seen by bond_handle_frame =>
>>> bond->recv_probe , even vlan ones - that would make eth0-bond0-bond0.5
>>> work and eth0-bond0-br0-bond0.5 as well. But again, need to verify this.
>>
>> All incoming ARPs are seen by bond_handle_frame, even vlan ones, definitely true.
>>
>> But as some of them *are* vlan ones, you will have to untag those "by hand" inside bond_handle_frame.
>

> Hmm. For hw vlan accel this would not be needed.

Agreed.

> But for non-hw-vlan-accel the frame is wrapped when going thru handle_frame. And yes, in that
> case untagging would be necessary. Unless the vlan untagging happening now in ptype_base handler
> is moved in rx path somewhere before __netif_receive_skb.

Can't it me moved not before, but inside __netif_receive_skb, as a rx_handler?

At the time we setup eth0.100, we can arrange for a vlan_untag rx_handler to be registered on eth0. 
This is exactly the way it works now for macvlan. Should it be possible (and desirable) for vlan too?

This might re-open the discussion about the need for several rx_handlers per interface, but that is 
another story.

Also, moving it before __netif_receive_skb would cause every protocol handlers to receive the frame 
untagged. At least protocol sniffers registered at ptype_all may want to receive the tagged frame, 
for diagnostic purpose.

That would result in two things:
>
> 1) Bond would be able to scope vlan packets
> 2) The rx path for hw-vlan-accel and non-hw-vlan-accel would be the same
>     (should be desirable + one reinjection would be avoided for
>      non-hw-vlan-accel)

Agreed, but moving it inside __netif_receive_skb should have the same effects. If we move 
non-hw-accel-vlan to a rx_handler, the skb would be COW before being untagged only if shared. This 
sounds good to me.

	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko March 9, 2011, 5:11 p.m. UTC | #8
Wed, Mar 09, 2011 at 04:28:34PM CET, nicolas.2p.debian@gmail.com wrote:
>Le 09/03/2011 16:09, Jiri Pirko a écrit :
>>Wed, Mar 09, 2011 at 03:49:49PM CET, nicolas.2p.debian@gmail.com wrote:
>>>Le 09/03/2011 08:45, Jiri Pirko a écrit :
>>>>Tue, Mar 08, 2011 at 10:44:37PM CET, nicolas.2p.debian@gmail.com wrote:
>>>>>Le 08/03/2011 14:42, Andy Gospodarek a écrit :
>>>>>>I'm pretty sure this patch will have the same catastrophic problem your
>>>>>>last one did.  By cloning and setting skb2->dev = orig_dev you just
>>>>>>inserted a frame identical to the one we received right back into the
>>>>>>stack.  It only took a few minutes for my box to melt as one frame on
>>>>>>the wire will cause an infinite number of frames to be received by the
>>>>>>stack.
>>>>>
>>>>>I agree with Andy. We still keep one reinject (netif_rx), which is
>>>>>probably better that two (__netif_receive_skb), but not enough.
>>>>>
>>>>>I really think we need a general framework for late delivery of final
>>>>>packets to packet handler registered somewhere in the rx_handler
>>>>>path.
>>>>>
>>>>>Jiri, is this patch the one you announced as "I have some kind nice
>>>>>solution in mind and I'm going to submit that as a patch later (too
>>>>>many patches are in the wind atm)" ?
>>>>
>>>>
>>>>I did not had time to verify my thought yet but I think that the only
>>>>think needed against my original patch (bonding: move processing of recv
>>>>handlers into handle_frame()) is ro remove vlan_on_bond_hook, period.
>>>>
>>>>Because all incoming arps are seen by bond_handle_frame =>
>>>>bond->recv_probe , even vlan ones - that would make eth0-bond0-bond0.5
>>>>work and eth0-bond0-br0-bond0.5 as well. But again, need to verify this.
>>>
>>>All incoming ARPs are seen by bond_handle_frame, even vlan ones, definitely true.
>>>
>>>But as some of them *are* vlan ones, you will have to untag those "by hand" inside bond_handle_frame.
>>
>
>>Hmm. For hw vlan accel this would not be needed.
>
>Agreed.
>
>>But for non-hw-vlan-accel the frame is wrapped when going thru handle_frame. And yes, in that
>>case untagging would be necessary. Unless the vlan untagging happening now in ptype_base handler
>>is moved in rx path somewhere before __netif_receive_skb.
>
>Can't it me moved not before, but inside __netif_receive_skb, as a rx_handler?

I don't think so - rx_handler is not right place to untag. rx_handler is
per device. untaging should happen idealy on realdev skb injection. That
way the rx path for hw-vlan-accel and non-he-vlan-accel would be very
similar.

>
>At the time we setup eth0.100, we can arrange for a vlan_untag
>rx_handler to be registered on eth0. This is exactly the way it works
>now for macvlan. Should it be possible (and desirable) for vlan too?
>
>This might re-open the discussion about the need for several
>rx_handlers per interface, but that is another story.
>
>Also, moving it before __netif_receive_skb would cause every protocol
>handlers to receive the frame untagged. At least protocol sniffers
>registered at ptype_all may want to receive the tagged frame, for
>diagnostic purpose.

Thats how its done for hw-vlan-accel. No problem here.

>
>That would result in two things:
>>
>>1) Bond would be able to scope vlan packets
>>2) The rx path for hw-vlan-accel and non-hw-vlan-accel would be the same
>>    (should be desirable + one reinjection would be avoided for
>>     non-hw-vlan-accel)
>
>Agreed, but moving it inside __netif_receive_skb should have the same
>effects. If we move non-hw-accel-vlan to a rx_handler, the skb would
>be COW before being untagged only if shared. This sounds good to me.
>
>	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko March 9, 2011, 8:51 p.m. UTC | #9
Wed, Mar 09, 2011 at 08:45:48AM CET, jpirko@redhat.com wrote:
>Tue, Mar 08, 2011 at 10:44:37PM CET, nicolas.2p.debian@gmail.com wrote:
>>Le 08/03/2011 14:42, Andy Gospodarek a écrit :
>>>I'm pretty sure this patch will have the same catastrophic problem your
>>>last one did.  By cloning and setting skb2->dev = orig_dev you just
>>>inserted a frame identical to the one we received right back into the
>>>stack.  It only took a few minutes for my box to melt as one frame on
>>>the wire will cause an infinite number of frames to be received by the
>>>stack.
>>
>>I agree with Andy. We still keep one reinject (netif_rx), which is
>>probably better that two (__netif_receive_skb), but not enough.
>>
>>I really think we need a general framework for late delivery of final
>>packets to packet handler registered somewhere in the rx_handler
>>path.
>>
>>Jiri, is this patch the one you announced as "I have some kind nice
>>solution in mind and I'm going to submit that as a patch later (too
>>many patches are in the wind atm)" ?
>
>
>I did not had time to verify my thought yet but I think that the only
>think needed against my original patch (bonding: move processing of recv
>handlers into handle_frame()) is ro remove vlan_on_bond_hook, period.
>
>Because all incoming arps are seen by bond_handle_frame =>
>bond->recv_probe , even vlan ones - that would make eth0-bond0-bond0.5
>work and eth0-bond0-br0-bond0.5 as well. But again, need to verify this.

Working good even for eth0-bond0-br0-bond0.5 mode1 arp_validate,
hwvlanaccel. Looking good so far.

>
>Jirka
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan March 9, 2011, 10:18 p.m. UTC | #10
Le 09/03/2011 18:11, Jiri Pirko a écrit :
> Wed, Mar 09, 2011 at 04:28:34PM CET, nicolas.2p.debian@gmail.com wrote:
>> Le 09/03/2011 16:09, Jiri Pirko a écrit :
>>> Wed, Mar 09, 2011 at 03:49:49PM CET, nicolas.2p.debian@gmail.com wrote:
>>>> Le 09/03/2011 08:45, Jiri Pirko a écrit :
>>>>> Tue, Mar 08, 2011 at 10:44:37PM CET, nicolas.2p.debian@gmail.com wrote:
>>>>>> Le 08/03/2011 14:42, Andy Gospodarek a écrit :
>>>>>>> I'm pretty sure this patch will have the same catastrophic problem your
>>>>>>> last one did.  By cloning and setting skb2->dev = orig_dev you just
>>>>>>> inserted a frame identical to the one we received right back into the
>>>>>>> stack.  It only took a few minutes for my box to melt as one frame on
>>>>>>> the wire will cause an infinite number of frames to be received by the
>>>>>>> stack.
>>>>>>
>>>>>> I agree with Andy. We still keep one reinject (netif_rx), which is
>>>>>> probably better that two (__netif_receive_skb), but not enough.
>>>>>>
>>>>>> I really think we need a general framework for late delivery of final
>>>>>> packets to packet handler registered somewhere in the rx_handler
>>>>>> path.
>>>>>>
>>>>>> Jiri, is this patch the one you announced as "I have some kind nice
>>>>>> solution in mind and I'm going to submit that as a patch later (too
>>>>>> many patches are in the wind atm)" ?
>>>>>
>>>>>
>>>>> I did not had time to verify my thought yet but I think that the only
>>>>> think needed against my original patch (bonding: move processing of recv
>>>>> handlers into handle_frame()) is ro remove vlan_on_bond_hook, period.
>>>>>
>>>>> Because all incoming arps are seen by bond_handle_frame =>
>>>>> bond->recv_probe , even vlan ones - that would make eth0-bond0-bond0.5
>>>>> work and eth0-bond0-br0-bond0.5 as well. But again, need to verify this.
>>>>
>>>> All incoming ARPs are seen by bond_handle_frame, even vlan ones, definitely true.
>>>>
>>>> But as some of them *are* vlan ones, you will have to untag those "by hand" inside bond_handle_frame.
>>>
>>
>>> Hmm. For hw vlan accel this would not be needed.
>>
>> Agreed.
>>
>>> But for non-hw-vlan-accel the frame is wrapped when going thru handle_frame. And yes, in that
>>> case untagging would be necessary. Unless the vlan untagging happening now in ptype_base handler
>>> is moved in rx path somewhere before __netif_receive_skb.
>>
>> Can't it me moved not before, but inside __netif_receive_skb, as a rx_handler?
>
> I don't think so - rx_handler is not right place to untag. rx_handler is
> per device. untaging should happen idealy on realdev skb injection. That
> way the rx path for hw-vlan-accel and non-he-vlan-accel would be very
> similar.

I don't understand your argument about "rx_handler is per device". A vlan interface is by design 
built on top a parent (physical or logical) interface. Why can't we use the rx_handler of the parent 
device to untag?

Untagging the frame very early (before __netif_receive_skb() would work if the value of skb->dev, at 
the time the frame enter __netif_receive_skb, is the parent interface for the vlan interface: eth0 
-> eth0.100. This is exactly what happens for hw-accel vlan frame and it sounds logical at first to 
do it the same way for non-hw-accel device.

But for all others setups, where there exist some net_devices before the "untagging" one, you would 
face some troubles. For example, with eth0+eth1 -> br0 -> br0.100, you cannot untag before entering 
__netif_receive_skb. If you do so, the bridge would receive untagged frame and if the frame is not 
for the local host, the bridge would forward an untagged frame while it is expected to forward a 
tagged one. Even if the bridge is in a position to know the frame *was* tagged, we cannot expect the 
bridge to do special processing to handle this situation. Doing so would break layering.

On another setup, eth0 -> eth0.100 -> br0 -> eth1.200 -> eth1, on the opposite, we expect the bridge 
to receive and to forward an untagged frame, that has been untagged while crossing eth0.100 and will 
be tagged while crossing eth1.200.

If we want to avoid reinjection as much as possible (and I advocate for that, you know :-)), then 
the only place to untag is in the rx_header for the parent device of the vlan device. I still don't 
understand the problem you see with this proposal.

And the only remaining discrepancy would be in hw-accel vs non-hw-accel. A protocol handler 
registered on eth0 -> eth0.100 would receive:
- a tagged frame for non-hw-accel.
- an untagged frame for hw-accel.

I think it is already true today, so I don't consider this a problem. We might arrange to fix this 
discrepancy later, if someone consider this a problem.

	Nicolas.

>> At the time we setup eth0.100, we can arrange for a vlan_untag
>> rx_handler to be registered on eth0. This is exactly the way it works
>> now for macvlan. Should it be possible (and desirable) for vlan too?
>>
>> This might re-open the discussion about the need for several
>> rx_handlers per interface, but that is another story.
>>
>> Also, moving it before __netif_receive_skb would cause every protocol
>> handlers to receive the frame untagged. At least protocol sniffers
>> registered at ptype_all may want to receive the tagged frame, for
>> diagnostic purpose.
>
> Thats how its done for hw-vlan-accel. No problem here.
>
>>
>> That would result in two things:
>>>
>>> 1) Bond would be able to scope vlan packets
>>> 2) The rx path for hw-vlan-accel and non-hw-vlan-accel would be the same
>>>     (should be desirable + one reinjection would be avoided for
>>>      non-hw-vlan-accel)
>>
>> Agreed, but moving it inside __netif_receive_skb should have the same
>> effects. If we move non-hw-accel-vlan to a rx_handler, the skb would
>> be COW before being untagged only if shared. This sounds good to me.
>>
>> 	Nicolas.
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index c71bd18..ec330e1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3094,12 +3094,12 @@  void netdev_rx_handler_unregister(struct net_device *dev)
 }
 EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
 
-static void vlan_on_bond_hook(struct sk_buff *skb)
+static void vlan_on_bond_hook(struct sk_buff *skb, struct net_device *orig_dev)
 {
 	/*
 	 * Make sure ARP frames received on VLAN interfaces stacked on
 	 * bonding interfaces still make their way to any base bonding
-	 * device that may have registered for a specific ptype.
+	 * device by reinjecting the frame into bonding slave (orig_dev)
 	 */
 	if (skb->dev->priv_flags & IFF_802_1Q_VLAN &&
 	    vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING &&
@@ -3108,7 +3108,7 @@  static void vlan_on_bond_hook(struct sk_buff *skb)
 
 		if (!skb2)
 			return;
-		skb2->dev = vlan_dev_real_dev(skb->dev);
+		skb2->dev = orig_dev;
 		netif_rx(skb2);
 	}
 }
@@ -3196,13 +3196,12 @@  ncls:
 			pt_prev = NULL;
 		}
 		if (vlan_hwaccel_do_receive(&skb)) {
-			ret = __netif_receive_skb(skb);
-			goto out;
+			goto another_round;
 		} else if (unlikely(!skb))
 			goto out;
 	}
 
-	vlan_on_bond_hook(skb);
+	vlan_on_bond_hook(skb, orig_dev);
 
 	/* deliver only exact match when indicated */
 	null_or_dev = deliver_exact ? skb->dev : NULL;