diff mbox

[net-next-2.6,V3] net: convert bonding to use rx_handler

Message ID 20110223190541.GB2783@psychotron.redhat.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Jiri Pirko Feb. 23, 2011, 7:05 p.m. UTC
This patch converts bonding to use rx_handler. Results in cleaner
__netif_receive_skb() with much less exceptions needed. Also
bond-specific work is moved into bond code.

Did performance test using pktgen and counting incoming packets by
iptables. No regression noted.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

v1->v2:
        using skb_iif instead of new input_dev to remember original
	device

v2->v3:
	do another loop in case skb->dev is changed. That way orig_dev
	core can be left untouched.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/bonding/bond_main.c |   74 ++++++++++++++++++++++++-
 net/core/dev.c                  |  119 ++++++++++-----------------------------
 2 files changed, 104 insertions(+), 89 deletions(-)

Comments

Nicolas de Pesloüan Feb. 25, 2011, 11:46 p.m. UTC | #1
Le 23/02/2011 20:05, Jiri Pirko a écrit :
> This patch converts bonding to use rx_handler. Results in cleaner
> __netif_receive_skb() with much less exceptions needed. Also
> bond-specific work is moved into bond code.
>
> Did performance test using pktgen and counting incoming packets by
> iptables. No regression noted.
>
> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>
> v1->v2:
>          using skb_iif instead of new input_dev to remember original
> 	device
>
> v2->v3:
> 	do another loop in case skb->dev is changed. That way orig_dev
> 	core can be left untouched.

Hi Jiri,

Eventually taking enough time for a review.

I think we should split this change :

1/ Change __netif_receive_skb() to call rx_handler for diverted net_device, until rx_handler is NULL.

2/ Convert currently existing rx_handlers (bridge and macvlan) to use this new "loop" feature, 
removing the need to call netif_rx() inside their respective rx_handler and also removing the 
associated overhead.

3/ Convert bonding to use rx_handlers.

Also, on step 1, we definitely need to clarify what orig_dev should be.

I now think that orig_dev should be "the device one level below the current one" or NULL if current 
device was not diverted from another one. It means that we should keep an array of crossed 
(diverted) devices and the associated orig_dev. This array would be used to pass the right orig_dev 
to protocol handlers, depending on the device they register on :

eth0 -> bond0 -> br0

A protocol handler registered on bond0 would receive eth0 as orig_dev.
A protocol handler registered on br0 would receive bond0 as orig_dev.

[snip]

> @@ -3167,32 +3135,8 @@ static int __netif_receive_skb(struct sk_buff *skb)

[snip]

> +another_round:
> +
> +	__this_cpu_inc(softnet_data.processed);
> +
>   #ifdef CONFIG_NET_CLS_ACT
>   	if (skb->tc_verd&  TC_NCLS) {
>   		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
> @@ -3209,8 +3157,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>   #endif
>
>   	list_for_each_entry_rcu(ptype,&ptype_all, list) {
> -		if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
> -		    ptype->dev == orig_dev) {
> +		if (!ptype->dev || ptype->dev == skb->dev) {
>   			if (pt_prev)
>   				ret = deliver_skb(skb, pt_prev, orig_dev);
>   			pt_prev = ptype;
> @@ -3224,16 +3171,20 @@ static int __netif_receive_skb(struct sk_buff *skb)
>   ncls:
>   #endif
>

Why do you loop to ptype_all before calling rx_handler ?

I don't understand why ptype_all and ptype_base are not handled at the same place in current 
__netif_receive_skb() but I think we should take the opportunity to change that, unless someone know 
of a good reason not to do so.

> -	/* Handle special case of bridge or macvlan */
>   	rx_handler = rcu_dereference(skb->dev->rx_handler);
>   	if (rx_handler) {

	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko Feb. 26, 2011, 7:14 a.m. UTC | #2
Sat, Feb 26, 2011 at 12:46:53AM CET, nicolas.2p.debian@gmail.com wrote:
>Le 23/02/2011 20:05, Jiri Pirko a écrit :
>>This patch converts bonding to use rx_handler. Results in cleaner
>>__netif_receive_skb() with much less exceptions needed. Also
>>bond-specific work is moved into bond code.
>>
>>Did performance test using pktgen and counting incoming packets by
>>iptables. No regression noted.
>>
>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>
>>v1->v2:
>>         using skb_iif instead of new input_dev to remember original
>>	device
>>
>>v2->v3:
>>	do another loop in case skb->dev is changed. That way orig_dev
>>	core can be left untouched.
>
>Hi Jiri,
>
>Eventually taking enough time for a review.
>
>I think we should split this change :
>
>1/ Change __netif_receive_skb() to call rx_handler for diverted net_device, until rx_handler is NULL.
>
>2/ Convert currently existing rx_handlers (bridge and macvlan) to use
>this new "loop" feature, removing the need to call netif_rx() inside
>their respective rx_handler and also removing the associated
>overhead.

This might not be possible. Macvlan uses result of called netif_rx for
counting, bridge calls netdev_receive_skb via NF_HOOK. Nevertheless,
this can be eventually handled later, not as a part of this patch.
>
>3/ Convert bonding to use rx_handlers.
>
>Also, on step 1, we definitely need to clarify what orig_dev should be.
>
>I now think that orig_dev should be "the device one level below the
>current one" or NULL if current device was not diverted from another
>one. It means that we should keep an array of crossed (diverted)
>devices and the associated orig_dev. This array would be used to pass
>the right orig_dev to protocol handlers, depending on the device they
>register on :

I constructed the patch in the way origdev is the same in all situations
as before the patch. I think that this decision can be ommitted at the
moment.

>
>eth0 -> bond0 -> br0
>
>A protocol handler registered on bond0 would receive eth0 as orig_dev.
>A protocol handler registered on br0 would receive bond0 as orig_dev.
>
>[snip]
>
>>@@ -3167,32 +3135,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
>
>[snip]
>
>>+another_round:
>>+
>>+	__this_cpu_inc(softnet_data.processed);
>>+
>>  #ifdef CONFIG_NET_CLS_ACT
>>  	if (skb->tc_verd&  TC_NCLS) {
>>  		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
>>@@ -3209,8 +3157,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>  #endif
>>
>>  	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>-		if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
>>-		    ptype->dev == orig_dev) {
>>+		if (!ptype->dev || ptype->dev == skb->dev) {
>>  			if (pt_prev)
>>  				ret = deliver_skb(skb, pt_prev, orig_dev);
>>  			pt_prev = ptype;
>>@@ -3224,16 +3171,20 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>  ncls:
>>  #endif
>>
>
>Why do you loop to ptype_all before calling rx_handler ?
>
>I don't understand why ptype_all and ptype_base are not handled at
>the same place in current __netif_receive_skb() but I think we should
>take the opportunity to change that, unless someone know of a good
>reason not to do so.

Again, the patch tries to do as little changes as it can. So this stays
the same as before. In case you want to change it, feel free to submit
patch doing that as follow-on.

>
>>-	/* Handle special case of bridge or macvlan */
>>  	rx_handler = rcu_dereference(skb->dev->rx_handler);
>>  	if (rx_handler) {
>
>	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan Feb. 26, 2011, 11:25 a.m. UTC | #3
Le 26/02/2011 08:14, Jiri Pirko a écrit :
> Sat, Feb 26, 2011 at 12:46:53AM CET, nicolas.2p.debian@gmail.com wrote:
>> Le 23/02/2011 20:05, Jiri Pirko a écrit :
>>> This patch converts bonding to use rx_handler. Results in cleaner
>>> __netif_receive_skb() with much less exceptions needed. Also
>>> bond-specific work is moved into bond code.
>>>
>>> Did performance test using pktgen and counting incoming packets by
>>> iptables. No regression noted.
>>>
>>> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>
>>> v1->v2:
>>>          using skb_iif instead of new input_dev to remember original
>>> 	device
>>>
>>> v2->v3:
>>> 	do another loop in case skb->dev is changed. That way orig_dev
>>> 	core can be left untouched.
>>
>> Hi Jiri,
>>
>> Eventually taking enough time for a review.
>>
>> I think we should split this change :
>>
>> 1/ Change __netif_receive_skb() to call rx_handler for diverted net_device, until rx_handler is NULL.
>>
>> 2/ Convert currently existing rx_handlers (bridge and macvlan) to use
>> this new "loop" feature, removing the need to call netif_rx() inside
>> their respective rx_handler and also removing the associated
>> overhead.
>
> This might not be possible. Macvlan uses result of called netif_rx for
> counting, bridge calls netdev_receive_skb via NF_HOOK. Nevertheless,
> this can be eventually handled later, not as a part of this patch.

Yes, I agree. Step 2 and step 3 can be swapped.

Anyway, we need to describe the options given to a rx_handler:

- Return skb unchanged. This would cause normal delivery (ptype->dev == NULL or ptype->dev == skb->dev).
- Return skb->dev changed. __netif_receive_skb() will loop to the new device. This would cause 
extact match delivery only (ptype->dev != NULL and ptype->dev == one of the orig_dev).
- Manage the skb another way and return NULL. This would stop any protocol handlers to receive the 
skb, except if the rx_handler arrange to re-inject the skb somewhere.

>> 3/ Convert bonding to use rx_handlers.
>>
>> Also, on step 1, we definitely need to clarify what orig_dev should be.
>>
>> I now think that orig_dev should be "the device one level below the
>> current one" or NULL if current device was not diverted from another
>> one. It means that we should keep an array of crossed (diverted)
>> devices and the associated orig_dev. This array would be used to pass
>> the right orig_dev to protocol handlers, depending on the device they
>> register on :
>
> I constructed the patch in the way origdev is the same in all situations
> as before the patch. I think that this decision can be ommitted at the
> moment.

Agreed, event if the current handling of orig_dev is far from bullet proof and needs to be clarified 
at some time.

>> eth0 ->  bond0 ->  br0
>>
>> A protocol handler registered on bond0 would receive eth0 as orig_dev.
>> A protocol handler registered on br0 would receive bond0 as orig_dev.
>>
>> [snip]
>>
>>> @@ -3167,32 +3135,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>
>> [snip]
>>
>>> +another_round:
>>> +
>>> +	__this_cpu_inc(softnet_data.processed);
>>> +
>>>   #ifdef CONFIG_NET_CLS_ACT
>>>   	if (skb->tc_verd&   TC_NCLS) {
>>>   		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
>>> @@ -3209,8 +3157,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>   #endif
>>>
>>>   	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>> -		if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
>>> -		    ptype->dev == orig_dev) {
>>> +		if (!ptype->dev || ptype->dev == skb->dev) {
>>>   			if (pt_prev)
>>>   				ret = deliver_skb(skb, pt_prev, orig_dev);
>>>   			pt_prev = ptype;
>>> @@ -3224,16 +3171,20 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>   ncls:
>>>   #endif
>>>
>>
>> Why do you loop to ptype_all before calling rx_handler ?
>>
>> I don't understand why ptype_all and ptype_base are not handled at
>> the same place in current __netif_receive_skb() but I think we should
>> take the opportunity to change that, unless someone know of a good
>> reason not to do so.
>
> Again, the patch tries to do as little changes as it can. So this stays
> the same as before. In case you want to change it, feel free to submit
> patch doing that as follow-on.

The point here is that bridge and macvlan handling used to be after the ptype_all loop (hence the 
place you inserted the call to rx_handler last summer), but the bonding part is currently before the 
ptype_all loop.

Moving bonding handling after the ptype_all loop will cause the ptype_all loop to be run twice:
- first time, with skb->dev == eth0 and orig_dev == eth0.
- second time, with skb->dev == bond0 and orig_dev == eth0.

The first time currently does not exists. And because bonding wasn't given a chance yet to decide 
that the frame should be dropped, the packet will always be delivered to eth0, causing duplicate 
deliveries. Note that this is probably true for bridge and macvlan too, and that those duplicate 
deliveries probably already exists.

Also, delivering skb inside a loop that may change the skb (skb->dev at least) is guaranteed to 
produce strange behaviors.

Can someone, knowing the history of ptype_all/ptype_base/bridge/macvlan/bonding/vlan handling in 
__netif_receive_skb(), comment on this?

Are there any reasons not to process ptype_all and ptype_base at the same location, at the end of 
__netif_receive_skb(), and to manage all divert features (bridge/macvlan/bonding/vlan) before?

	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko Feb. 26, 2011, 2:58 p.m. UTC | #4
Sat, Feb 26, 2011 at 12:25:18PM CET, nicolas.2p.debian@gmail.com wrote:
>Le 26/02/2011 08:14, Jiri Pirko a écrit :
>>Sat, Feb 26, 2011 at 12:46:53AM CET, nicolas.2p.debian@gmail.com wrote:
>>>Le 23/02/2011 20:05, Jiri Pirko a écrit :
>>>>This patch converts bonding to use rx_handler. Results in cleaner
>>>>__netif_receive_skb() with much less exceptions needed. Also
>>>>bond-specific work is moved into bond code.
>>>>
>>>>Did performance test using pktgen and counting incoming packets by
>>>>iptables. No regression noted.
>>>>
>>>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>>
>>>>v1->v2:
>>>>         using skb_iif instead of new input_dev to remember original
>>>>	device
>>>>
>>>>v2->v3:
>>>>	do another loop in case skb->dev is changed. That way orig_dev
>>>>	core can be left untouched.
>>>
>>>Hi Jiri,
>>>
>>>Eventually taking enough time for a review.
>>>
>>>I think we should split this change :
>>>
>>>1/ Change __netif_receive_skb() to call rx_handler for diverted net_device, until rx_handler is NULL.
>>>
>>>2/ Convert currently existing rx_handlers (bridge and macvlan) to use
>>>this new "loop" feature, removing the need to call netif_rx() inside
>>>their respective rx_handler and also removing the associated
>>>overhead.
>>
>>This might not be possible. Macvlan uses result of called netif_rx for
>>counting, bridge calls netdev_receive_skb via NF_HOOK. Nevertheless,
>>this can be eventually handled later, not as a part of this patch.
>
>Yes, I agree. Step 2 and step 3 can be swapped.
>
>Anyway, we need to describe the options given to a rx_handler:
>
>- Return skb unchanged. This would cause normal delivery (ptype->dev == NULL or ptype->dev == skb->dev).
>- Return skb->dev changed. __netif_receive_skb() will loop to the new
>device. This would cause extact match delivery only (ptype->dev !=
>NULL and ptype->dev == one of the orig_dev).
>- Manage the skb another way and return NULL. This would stop any
>protocol handlers to receive the skb, except if the rx_handler
>arrange to re-inject the skb somewhere.
>
>>>3/ Convert bonding to use rx_handlers.
>>>
>>>Also, on step 1, we definitely need to clarify what orig_dev should be.
>>>
>>>I now think that orig_dev should be "the device one level below the
>>>current one" or NULL if current device was not diverted from another
>>>one. It means that we should keep an array of crossed (diverted)
>>>devices and the associated orig_dev. This array would be used to pass
>>>the right orig_dev to protocol handlers, depending on the device they
>>>register on :
>>
>>I constructed the patch in the way origdev is the same in all situations
>>as before the patch. I think that this decision can be ommitted at the
>>moment.
>
>Agreed, event if the current handling of orig_dev is far from bullet
>proof and needs to be clarified at some time.
>
>>>eth0 ->  bond0 ->  br0
>>>
>>>A protocol handler registered on bond0 would receive eth0 as orig_dev.
>>>A protocol handler registered on br0 would receive bond0 as orig_dev.
>>>
>>>[snip]
>>>
>>>>@@ -3167,32 +3135,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>
>>>[snip]
>>>
>>>>+another_round:
>>>>+
>>>>+	__this_cpu_inc(softnet_data.processed);
>>>>+
>>>>  #ifdef CONFIG_NET_CLS_ACT
>>>>  	if (skb->tc_verd&   TC_NCLS) {
>>>>  		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
>>>>@@ -3209,8 +3157,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>  #endif
>>>>
>>>>  	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>>>-		if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
>>>>-		    ptype->dev == orig_dev) {
>>>>+		if (!ptype->dev || ptype->dev == skb->dev) {
>>>>  			if (pt_prev)
>>>>  				ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>  			pt_prev = ptype;
>>>>@@ -3224,16 +3171,20 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>  ncls:
>>>>  #endif
>>>>
>>>
>>>Why do you loop to ptype_all before calling rx_handler ?
>>>
>>>I don't understand why ptype_all and ptype_base are not handled at
>>>the same place in current __netif_receive_skb() but I think we should
>>>take the opportunity to change that, unless someone know of a good
>>>reason not to do so.
>>
>>Again, the patch tries to do as little changes as it can. So this stays
>>the same as before. In case you want to change it, feel free to submit
>>patch doing that as follow-on.
>
>The point here is that bridge and macvlan handling used to be after
>the ptype_all loop (hence the place you inserted the call to
>rx_handler last summer), but the bonding part is currently before the
>ptype_all loop.
>
>Moving bonding handling after the ptype_all loop will cause the ptype_all loop to be run twice:
>- first time, with skb->dev == eth0 and orig_dev == eth0.
>- second time, with skb->dev == bond0 and orig_dev == eth0.
>
>The first time currently does not exists. And because bonding wasn't
>given a chance yet to decide that the frame should be dropped, the
>packet will always be delivered to eth0, causing duplicate
>deliveries. Note that this is probably true for bridge and macvlan
>too, and that those duplicate deliveries probably already exists.

Yes, and in fact that was what I like about this patch, that then
deliveries are simillar to bridge.

>
>Also, delivering skb inside a loop that may change the skb (skb->dev
>at least) is guaranteed to produce strange behaviors.
>
>Can someone, knowing the history of
>ptype_all/ptype_base/bridge/macvlan/bonding/vlan handling in
>__netif_receive_skb(), comment on this?
>
>Are there any reasons not to process ptype_all and ptype_base at the
>same location, at the end of __netif_receive_skb(), and to manage all
>divert features (bridge/macvlan/bonding/vlan) before?

That is very good set of questions. Would like to hear answers too.

>
>	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan Feb. 27, 2011, 2:17 p.m. UTC | #5
Le 23/02/2011 20:05, Jiri Pirko a écrit :
> This patch converts bonding to use rx_handler. Results in cleaner
> __netif_receive_skb() with much less exceptions needed. Also
> bond-specific work is moved into bond code.
>
> Did performance test using pktgen and counting incoming packets by
> iptables. No regression noted.
>
> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>
> v1->v2:
>          using skb_iif instead of new input_dev to remember original
> 	device
>
> v2->v3:
> 	do another loop in case skb->dev is changed. That way orig_dev
> 	core can be left untouched.
>
> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
> ---

[snip]

> +static struct sk_buff *bond_handle_frame(struct sk_buff *skb)
> +{
> +	struct net_device *slave_dev;
> +	struct net_device *bond_dev;
> +
> +	skb = skb_share_check(skb, GFP_ATOMIC);
> +	if (unlikely(!skb))
> +		return NULL;
> +	slave_dev = skb->dev;
> +	bond_dev = ACCESS_ONCE(slave_dev->master);
> +	if (unlikely(!bond_dev))
> +		return skb;
> +
> +	if (bond_dev->priv_flags&  IFF_MASTER_ARPMON)
> +		slave_dev->last_rx = jiffies;
> +
> +	if (bond_should_deliver_exact_match(skb, slave_dev, bond_dev)) {
> +		skb->deliver_no_wcard = 1;
> +		return skb;

Shouldn't we return NULL here ?

> +	}
> +
> +	skb->dev = bond_dev;
> +
> +	if (bond_dev->priv_flags&  IFF_MASTER_ALB&&
> +	    bond_dev->priv_flags&  IFF_BRIDGE_PORT&&
> +	    skb->pkt_type == PACKET_HOST) {
> +		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
> +
> +		memcpy(dest, bond_dev->dev_addr, ETH_ALEN);
> +	}
> +
> +	return skb;
> +}
> +

[snip]

> +static void vlan_on_bond_hook(struct sk_buff *skb)
>   {
> -	if (skb->pkt_type == PACKET_HOST) {
> -		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
> +	/*
> +	 * Make sure ARP frames received on VLAN interfaces stacked on
> +	 * bonding interfaces still make their way to any base bonding
> +	 * device that may have registered for a specific ptype.
> +	 */
> +	if (skb->dev->priv_flags&  IFF_802_1Q_VLAN&&
> +	    vlan_dev_real_dev(skb->dev)->priv_flags&  IFF_BONDING&&
> +	    skb->protocol == htons(ETH_P_ARP)) {

The vlan_on_bond case used to be cost effective. Now, we clone the skb and call netif_rx...

> +		struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
>
> -		memcpy(dest, master->dev_addr, ETH_ALEN);
> +		if (!skb2)
> +			return;
> +		skb2->dev = vlan_dev_real_dev(skb->dev);
> +		netif_rx(skb2);
>   	}
>   }

[snip]

>   	if (rx_handler) {
> +		struct net_device *prev_dev;
> +
>   		if (pt_prev) {
>   			ret = deliver_skb(skb, pt_prev, orig_dev);
>   			pt_prev = NULL;
>   		}
> +		prev_dev = skb->dev;
>   		skb = rx_handler(skb);
>   		if (!skb)
>   			goto out;

I would instead consider NULL as meaning exact-match-delivery-only. (The same effect as 
dev_bond_should_drop() returning true).

> +		if (skb->dev != prev_dev)
> +			goto another_round;
>   	}

Anyway, all my comments can't be postponed to follow-up patchs. Thanks Jiri.

Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko Feb. 27, 2011, 8:06 p.m. UTC | #6
Sun, Feb 27, 2011 at 03:17:01PM CET, nicolas.2p.debian@gmail.com wrote:
>Le 23/02/2011 20:05, Jiri Pirko a écrit :
>>This patch converts bonding to use rx_handler. Results in cleaner
>>__netif_receive_skb() with much less exceptions needed. Also
>>bond-specific work is moved into bond code.
>>
>>Did performance test using pktgen and counting incoming packets by
>>iptables. No regression noted.
>>
>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>
>>v1->v2:
>>         using skb_iif instead of new input_dev to remember original
>>	device
>>
>>v2->v3:
>>	do another loop in case skb->dev is changed. That way orig_dev
>>	core can be left untouched.
>>
>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>---
>
>[snip]
>
>>+static struct sk_buff *bond_handle_frame(struct sk_buff *skb)
>>+{
>>+	struct net_device *slave_dev;
>>+	struct net_device *bond_dev;
>>+
>>+	skb = skb_share_check(skb, GFP_ATOMIC);
>>+	if (unlikely(!skb))
>>+		return NULL;
>>+	slave_dev = skb->dev;
>>+	bond_dev = ACCESS_ONCE(slave_dev->master);
>>+	if (unlikely(!bond_dev))
>>+		return skb;
>>+
>>+	if (bond_dev->priv_flags&  IFF_MASTER_ARPMON)
>>+		slave_dev->last_rx = jiffies;
>>+
>>+	if (bond_should_deliver_exact_match(skb, slave_dev, bond_dev)) {
>>+		skb->deliver_no_wcard = 1;
>>+		return skb;
>
>Shouldn't we return NULL here ?

No we shouldn't. We need sbk to be delivered to exact match.

>
>>+	}
>>+
>>+	skb->dev = bond_dev;
>>+
>>+	if (bond_dev->priv_flags&  IFF_MASTER_ALB&&
>>+	    bond_dev->priv_flags&  IFF_BRIDGE_PORT&&
>>+	    skb->pkt_type == PACKET_HOST) {
>>+		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
>>+
>>+		memcpy(dest, bond_dev->dev_addr, ETH_ALEN);
>>+	}
>>+
>>+	return skb;
>>+}
>>+
>
>[snip]
>
>>+static void vlan_on_bond_hook(struct sk_buff *skb)
>>  {
>>-	if (skb->pkt_type == PACKET_HOST) {
>>-		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
>>+	/*
>>+	 * Make sure ARP frames received on VLAN interfaces stacked on
>>+	 * bonding interfaces still make their way to any base bonding
>>+	 * device that may have registered for a specific ptype.
>>+	 */
>>+	if (skb->dev->priv_flags&  IFF_802_1Q_VLAN&&
>>+	    vlan_dev_real_dev(skb->dev)->priv_flags&  IFF_BONDING&&
>>+	    skb->protocol == htons(ETH_P_ARP)) {
>
>The vlan_on_bond case used to be cost effective. Now, we clone the skb and call netif_rx...

This should not cost too much overhead considering only few packets are
going thru this. This hook shouldn't have exited in the fisrt place. I
think introducing this functionality was a big mistake.
>
>>+		struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
>>
>>-		memcpy(dest, master->dev_addr, ETH_ALEN);
>>+		if (!skb2)
>>+			return;
>>+		skb2->dev = vlan_dev_real_dev(skb->dev);
>>+		netif_rx(skb2);
>>  	}
>>  }
>
>[snip]
>
>>  	if (rx_handler) {
>>+		struct net_device *prev_dev;
>>+
>>  		if (pt_prev) {
>>  			ret = deliver_skb(skb, pt_prev, orig_dev);
>>  			pt_prev = NULL;
>>  		}
>>+		prev_dev = skb->dev;
>>  		skb = rx_handler(skb);
>>  		if (!skb)
>>  			goto out;
>
>I would instead consider NULL as meaning exact-match-delivery-only.
>(The same effect as dev_bond_should_drop() returning true).

we can change the behaviour later on.

>
>>+		if (skb->dev != prev_dev)
>>+			goto another_round;
>>  	}
>
>Anyway, all my comments can't be postponed to follow-up patchs. Thanks Jiri.
>
>Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan Feb. 27, 2011, 8:59 p.m. UTC | #7
Le 27/02/2011 21:06, Jiri Pirko a écrit :
> Sun, Feb 27, 2011 at 03:17:01PM CET, nicolas.2p.debian@gmail.com wrote:

>>> +	if (bond_should_deliver_exact_match(skb, slave_dev, bond_dev)) {
>>> +		skb->deliver_no_wcard = 1;
>>> +		return skb;
>>
>> Shouldn't we return NULL here ?
>
> No we shouldn't. We need sbk to be delivered to exact match.

So, if I understand properly:

- If skb->dev changed, loop,
- else, if skb->deliver_no_wcard, do exact match delivery only,
- Else, if !skb, drop the frame, without ever exact match delivery,
- Else, do normal delivery.

Right?

>> The vlan_on_bond case used to be cost effective. Now, we clone the skb and call netif_rx...
>
> This should not cost too much overhead considering only few packets are
> going thru this. This hook shouldn't have exited in the fisrt place. I
> think introducing this functionality was a big mistake.

What would you have proposed instead?

Anyway, I think the feature is broken, because it wouldn't provide the expected effect on the 
following configuration:

eth0/eth1 -> bond0 -> br0 -> br0.100.

We probably need a more general way to fix this, after your patch have been accepted.

[snip]

>> I would instead consider NULL as meaning exact-match-delivery-only.
>> (The same effect as dev_bond_should_drop() returning true).
>
> we can change the behaviour later on.

Agreed.

	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 77e3c6a..3772b61 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1423,6 +1423,67 @@  static void bond_setup_by_slave(struct net_device *bond_dev,
 	bond->setup_by_slave = 1;
 }
 
+/* On bonding slaves other than the currently active slave, suppress
+ * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
+ * ARP on active-backup slaves with arp_validate enabled.
+ */
+static bool bond_should_deliver_exact_match(struct sk_buff *skb,
+					    struct net_device *slave_dev,
+					    struct net_device *bond_dev)
+{
+	if (slave_dev->priv_flags & IFF_SLAVE_INACTIVE) {
+		if (slave_dev->priv_flags & IFF_SLAVE_NEEDARP &&
+		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
+			return false;
+
+		if (bond_dev->priv_flags & IFF_MASTER_ALB &&
+		    skb->pkt_type != PACKET_BROADCAST &&
+		    skb->pkt_type != PACKET_MULTICAST)
+				return false;
+
+		if (bond_dev->priv_flags & IFF_MASTER_8023AD &&
+		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
+			return false;
+
+		return true;
+	}
+	return false;
+}
+
+static struct sk_buff *bond_handle_frame(struct sk_buff *skb)
+{
+	struct net_device *slave_dev;
+	struct net_device *bond_dev;
+
+	skb = skb_share_check(skb, GFP_ATOMIC);
+	if (unlikely(!skb))
+		return NULL;
+	slave_dev = skb->dev;
+	bond_dev = ACCESS_ONCE(slave_dev->master);
+	if (unlikely(!bond_dev))
+		return skb;
+
+	if (bond_dev->priv_flags & IFF_MASTER_ARPMON)
+		slave_dev->last_rx = jiffies;
+
+	if (bond_should_deliver_exact_match(skb, slave_dev, bond_dev)) {
+		skb->deliver_no_wcard = 1;
+		return skb;
+	}
+
+	skb->dev = bond_dev;
+
+	if (bond_dev->priv_flags & IFF_MASTER_ALB &&
+	    bond_dev->priv_flags & IFF_BRIDGE_PORT &&
+	    skb->pkt_type == PACKET_HOST) {
+		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
+
+		memcpy(dest, bond_dev->dev_addr, ETH_ALEN);
+	}
+
+	return skb;
+}
+
 /* enslave device <slave> to bond device <master> */
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 {
@@ -1599,11 +1660,17 @@  int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		pr_debug("Error %d calling netdev_set_bond_master\n", res);
 		goto err_restore_mac;
 	}
+	res = netdev_rx_handler_register(slave_dev, bond_handle_frame, NULL);
+	if (res) {
+		pr_debug("Error %d calling netdev_rx_handler_register\n", res);
+		goto err_unset_master;
+	}
+
 	/* open the slave since the application closed it */
 	res = dev_open(slave_dev);
 	if (res) {
 		pr_debug("Opening slave %s failed\n", slave_dev->name);
-		goto err_unset_master;
+		goto err_unreg_rxhandler;
 	}
 
 	new_slave->dev = slave_dev;
@@ -1811,6 +1878,9 @@  int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 err_close:
 	dev_close(slave_dev);
 
+err_unreg_rxhandler:
+	netdev_rx_handler_unregister(slave_dev);
+
 err_unset_master:
 	netdev_set_bond_master(slave_dev, NULL);
 
@@ -1992,6 +2062,7 @@  int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 		netif_addr_unlock_bh(bond_dev);
 	}
 
+	netdev_rx_handler_unregister(slave_dev);
 	netdev_set_bond_master(slave_dev, NULL);
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
@@ -2114,6 +2185,7 @@  static int bond_release_all(struct net_device *bond_dev)
 			netif_addr_unlock_bh(bond_dev);
 		}
 
+		netdev_rx_handler_unregister(slave_dev);
 		netdev_set_bond_master(slave_dev, NULL);
 
 		/* close slave before restoring its mac address */
diff --git a/net/core/dev.c b/net/core/dev.c
index 578415c..e06a834 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3096,63 +3096,31 @@  void netdev_rx_handler_unregister(struct net_device *dev)
 }
 EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
 
-static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
-					      struct net_device *master)
+static void vlan_on_bond_hook(struct sk_buff *skb)
 {
-	if (skb->pkt_type == PACKET_HOST) {
-		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
+	/*
+	 * Make sure ARP frames received on VLAN interfaces stacked on
+	 * bonding interfaces still make their way to any base bonding
+	 * device that may have registered for a specific ptype.
+	 */
+	if (skb->dev->priv_flags & IFF_802_1Q_VLAN &&
+	    vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING &&
+	    skb->protocol == htons(ETH_P_ARP)) {
+		struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
 
-		memcpy(dest, master->dev_addr, ETH_ALEN);
+		if (!skb2)
+			return;
+		skb2->dev = vlan_dev_real_dev(skb->dev);
+		netif_rx(skb2);
 	}
 }
 
-/* On bonding slaves other than the currently active slave, suppress
- * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
- * ARP on active-backup slaves with arp_validate enabled.
- */
-static int __skb_bond_should_drop(struct sk_buff *skb,
-				  struct net_device *master)
-{
-	struct net_device *dev = skb->dev;
-
-	if (master->priv_flags & IFF_MASTER_ARPMON)
-		dev->last_rx = jiffies;
-
-	if ((master->priv_flags & IFF_MASTER_ALB) &&
-	    (master->priv_flags & IFF_BRIDGE_PORT)) {
-		/* Do address unmangle. The local destination address
-		 * will be always the one master has. Provides the right
-		 * functionality in a bridge.
-		 */
-		skb_bond_set_mac_by_master(skb, master);
-	}
-
-	if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
-		if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
-		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
-			return 0;
-
-		if (master->priv_flags & IFF_MASTER_ALB) {
-			if (skb->pkt_type != PACKET_BROADCAST &&
-			    skb->pkt_type != PACKET_MULTICAST)
-				return 0;
-		}
-		if (master->priv_flags & IFF_MASTER_8023AD &&
-		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
-			return 0;
-
-		return 1;
-	}
-	return 0;
-}
-
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
 	rx_handler_func_t *rx_handler;
 	struct net_device *orig_dev;
-	struct net_device *null_or_orig;
-	struct net_device *orig_or_bond;
+	struct net_device *null_or_dev;
 	int ret = NET_RX_DROP;
 	__be16 type;
 
@@ -3167,32 +3135,8 @@  static int __netif_receive_skb(struct sk_buff *skb)
 
 	if (!skb->skb_iif)
 		skb->skb_iif = skb->dev->ifindex;
-
-	/*
-	 * bonding note: skbs received on inactive slaves should only
-	 * be delivered to pkt handlers that are exact matches.  Also
-	 * the deliver_no_wcard flag will be set.  If packet handlers
-	 * are sensitive to duplicate packets these skbs will need to
-	 * be dropped at the handler.
-	 */
-	null_or_orig = NULL;
 	orig_dev = skb->dev;
-	if (skb->deliver_no_wcard)
-		null_or_orig = orig_dev;
-	else if (netif_is_bond_slave(orig_dev)) {
-		struct net_device *bond_master = ACCESS_ONCE(orig_dev->master);
-
-		if (likely(bond_master)) {
-			if (__skb_bond_should_drop(skb, bond_master)) {
-				skb->deliver_no_wcard = 1;
-				/* deliver only exact match */
-				null_or_orig = orig_dev;
-			} else
-				skb->dev = bond_master;
-		}
-	}
 
-	__this_cpu_inc(softnet_data.processed);
 	skb_reset_network_header(skb);
 	skb_reset_transport_header(skb);
 	skb->mac_len = skb->network_header - skb->mac_header;
@@ -3201,6 +3145,10 @@  static int __netif_receive_skb(struct sk_buff *skb)
 
 	rcu_read_lock();
 
+another_round:
+
+	__this_cpu_inc(softnet_data.processed);
+
 #ifdef CONFIG_NET_CLS_ACT
 	if (skb->tc_verd & TC_NCLS) {
 		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
@@ -3209,8 +3157,7 @@  static int __netif_receive_skb(struct sk_buff *skb)
 #endif
 
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
-		if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
-		    ptype->dev == orig_dev) {
+		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
 				ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = ptype;
@@ -3224,16 +3171,20 @@  static int __netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
-	/* Handle special case of bridge or macvlan */
 	rx_handler = rcu_dereference(skb->dev->rx_handler);
 	if (rx_handler) {
+		struct net_device *prev_dev;
+
 		if (pt_prev) {
 			ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = NULL;
 		}
+		prev_dev = skb->dev;
 		skb = rx_handler(skb);
 		if (!skb)
 			goto out;
+		if (skb->dev != prev_dev)
+			goto another_round;
 	}
 
 	if (vlan_tx_tag_present(skb)) {
@@ -3248,24 +3199,16 @@  ncls:
 			goto out;
 	}
 
-	/*
-	 * Make sure frames received on VLAN interfaces stacked on
-	 * bonding interfaces still make their way to any base bonding
-	 * device that may have registered for a specific ptype.  The
-	 * handler may have to adjust skb->dev and orig_dev.
-	 */
-	orig_or_bond = orig_dev;
-	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
-	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
-		orig_or_bond = vlan_dev_real_dev(skb->dev);
-	}
+	vlan_on_bond_hook(skb);
+
+	/* deliver only exact match when indicated */
+	null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
 
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype,
 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
-		if (ptype->type == type && (ptype->dev == null_or_orig ||
-		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
-		     ptype->dev == orig_or_bond)) {
+		if (ptype->type == type &&
+		    (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
 			if (pt_prev)
 				ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = ptype;