diff mbox

bonding: reset queue mapping prior to transmission to physical device

Message ID 1307037799-32315-1-git-send-email-nhorman@tuxdriver.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Neil Horman June 2, 2011, 6:03 p.m. UTC
The bonding driver is multiqueue enabled, in which each queue represents a slave
to enable optional steering of output frames to given slaves against the default
output policy.  However, it needs to reset the skb->queue_mapping prior to
queuing to the physical device or the physical slave (if it is multiqueue) could
wind up transmitting on an unintended tx queue (one that was reserved for
specific traffic classes for instance)

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
---
 drivers/net/bonding/bond_main.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

Comments

Ben Hutchings June 2, 2011, 6:35 p.m. UTC | #1
On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote:
> The bonding driver is multiqueue enabled, in which each queue represents a slave
> to enable optional steering of output frames to given slaves against the default
> output policy.  However, it needs to reset the skb->queue_mapping prior to
> queuing to the physical device or the physical slave (if it is multiqueue) could
> wind up transmitting on an unintended tx queue (one that was reserved for
> specific traffic classes for instance)
>
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: Jay Vosburgh <fubar@us.ibm.com>
> CC: Andy Gospodarek <andy@greyhouse.net>
> CC: "David S. Miller" <davem@davemloft.net>
> ---
>  drivers/net/bonding/bond_main.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 17b4dd9..812c70c 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -4200,6 +4200,11 @@ static inline int bond_slave_override(struct bonding *bond,
>  	/* If the slave isn't UP, use default transmit policy. */
>  	if (slave && slave->queue_id && IS_UP(slave->dev) &&
>  	    (slave->link == BOND_LINK_UP)) {
> +		/*
> + 		 * Reset the queue mapping to allow the underlying slave	
> + 		 * the chance to re-select an appropriate tx queue
> + 		 */
> +		skb_set_queue_mapping(skb, 0);
>  		res = bond_dev_queue_xmit(bond, skb, slave->dev);
>  	}
>  

So far as I can see, this has no effect, because dev_queue_xmit() always
sets queue_mapping (in dev_pick_tx()).

What is the problem you're seeing?

Ben.
Neil Horman June 2, 2011, 6:56 p.m. UTC | #2
On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote:
> On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote:
> > The bonding driver is multiqueue enabled, in which each queue represents a slave
> > to enable optional steering of output frames to given slaves against the default
> > output policy.  However, it needs to reset the skb->queue_mapping prior to
> > queuing to the physical device or the physical slave (if it is multiqueue) could
> > wind up transmitting on an unintended tx queue (one that was reserved for
> > specific traffic classes for instance)
> >
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > CC: Jay Vosburgh <fubar@us.ibm.com>
> > CC: Andy Gospodarek <andy@greyhouse.net>
> > CC: "David S. Miller" <davem@davemloft.net>
> > ---
> >  drivers/net/bonding/bond_main.c |    5 +++++
> >  1 files changed, 5 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> > index 17b4dd9..812c70c 100644
> > --- a/drivers/net/bonding/bond_main.c
> > +++ b/drivers/net/bonding/bond_main.c
> > @@ -4200,6 +4200,11 @@ static inline int bond_slave_override(struct bonding *bond,
> >  	/* If the slave isn't UP, use default transmit policy. */
> >  	if (slave && slave->queue_id && IS_UP(slave->dev) &&
> >  	    (slave->link == BOND_LINK_UP)) {
> > +		/*
> > + 		 * Reset the queue mapping to allow the underlying slave	
> > + 		 * the chance to re-select an appropriate tx queue
> > + 		 */
> > +		skb_set_queue_mapping(skb, 0);
> >  		res = bond_dev_queue_xmit(bond, skb, slave->dev);
> >  	}
> >  
> 
> So far as I can see, this has no effect, because dev_queue_xmit() always
> sets queue_mapping (in dev_pick_tx()).
> 
it resets the queue mapping exactly as you would expect it to.  bonding is a
multiqueue enabled device and selects a potentially non-zero queue based on the
output of bond_select_queue.

> What is the problem you're seeing?
> 
The problem is exctly that.  dev_pick_tx() on the bond device sets the
queue_mapping as per the result of bond_select_queue (the ndo_select_queue
method for the bonding driver).  The implementation there is based on the use of
tc with bonding, so that output slaves can be selected for certain types of
traffic.  But when that mechanism is used, skb->queue_mapping is preserved when
the bonding driver queues the frame to the underlying slave.  This denies the
slave (if its also a multiqueue device) the opportunity to reselect the queue
properly, because of this call path:

bond_queue_xmit
 dev_queue_xmit(slave_dev)
  dev_pick_tx()
   skb_tx_hash()
    __skb_tx_hash()

__skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping
based on what the bonding driver chose using its own internal logic.  Since
bonding uses the multiqueue infrastructure to do slave output selection without
any regard for slave output queue selection, it seems to me we should really
reset the queue mapping to zero so the slave device can pick its own tx queue.

Neil

> Ben.
> 
> -- 
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings June 2, 2011, 7:09 p.m. UTC | #3
On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote:
> On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote:
> > On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote:
> > > The bonding driver is multiqueue enabled, in which each queue represents a slave
> > > to enable optional steering of output frames to given slaves against the default
> > > output policy.  However, it needs to reset the skb->queue_mapping prior to
> > > queuing to the physical device or the physical slave (if it is multiqueue) could
> > > wind up transmitting on an unintended tx queue (one that was reserved for
> > > specific traffic classes for instance)
[...]
> > So far as I can see, this has no effect, because dev_queue_xmit() always
> > sets queue_mapping (in dev_pick_tx()).
> > 
> it resets the queue mapping exactly as you would expect it to.  bonding is a
> multiqueue enabled device and selects a potentially non-zero queue based on the
> output of bond_select_queue.
> 
> > What is the problem you're seeing?
> > 
> The problem is exctly that.  dev_pick_tx() on the bond device sets the
> queue_mapping as per the result of bond_select_queue (the ndo_select_queue
> method for the bonding driver).  The implementation there is based on the use of
> tc with bonding, so that output slaves can be selected for certain types of
> traffic.  But when that mechanism is used, skb->queue_mapping is preserved when
> the bonding driver queues the frame to the underlying slave.  This denies the
> slave (if its also a multiqueue device) the opportunity to reselect the queue
> properly, because of this call path:
> 
> bond_queue_xmit
>  dev_queue_xmit(slave_dev)
>   dev_pick_tx()
>    skb_tx_hash()
>     __skb_tx_hash()
> 
> __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping
> based on what the bonding driver chose using its own internal logic.  Since
> bonding uses the multiqueue infrastructure to do slave output selection without
> any regard for slave output queue selection, it seems to me we should really
> reset the queue mapping to zero so the slave device can pick its own tx queue.

So you're effectively clearing the *RX queue* number (as this is before
dev_pick_tx()) in order to influence TX queue selection.

Here, the bonding device seems to be behaving as a forwarding device.
If TX queue selection can go wrong for certain combinations of queue
configuration when forwarding, then this is a problem for IP forwarding
and bridging as well, isn't it?

Ben.
Neil Horman June 2, 2011, 7:46 p.m. UTC | #4
On Thu, Jun 02, 2011 at 08:09:49PM +0100, Ben Hutchings wrote:
> On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote:
> > On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote:
> > > On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote:
> > > > The bonding driver is multiqueue enabled, in which each queue represents a slave
> > > > to enable optional steering of output frames to given slaves against the default
> > > > output policy.  However, it needs to reset the skb->queue_mapping prior to
> > > > queuing to the physical device or the physical slave (if it is multiqueue) could
> > > > wind up transmitting on an unintended tx queue (one that was reserved for
> > > > specific traffic classes for instance)
> [...]
> > > So far as I can see, this has no effect, because dev_queue_xmit() always
> > > sets queue_mapping (in dev_pick_tx()).
> > > 
> > it resets the queue mapping exactly as you would expect it to.  bonding is a
> > multiqueue enabled device and selects a potentially non-zero queue based on the
> > output of bond_select_queue.
> > 
> > > What is the problem you're seeing?
> > > 
> > The problem is exctly that.  dev_pick_tx() on the bond device sets the
> > queue_mapping as per the result of bond_select_queue (the ndo_select_queue
> > method for the bonding driver).  The implementation there is based on the use of
> > tc with bonding, so that output slaves can be selected for certain types of
> > traffic.  But when that mechanism is used, skb->queue_mapping is preserved when
> > the bonding driver queues the frame to the underlying slave.  This denies the
> > slave (if its also a multiqueue device) the opportunity to reselect the queue
> > properly, because of this call path:
> > 
> > bond_queue_xmit
> >  dev_queue_xmit(slave_dev)
> >   dev_pick_tx()
> >    skb_tx_hash()
> >     __skb_tx_hash()
> > 
> > __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping
> > based on what the bonding driver chose using its own internal logic.  Since
> > bonding uses the multiqueue infrastructure to do slave output selection without
> > any regard for slave output queue selection, it seems to me we should really
> > reset the queue mapping to zero so the slave device can pick its own tx queue.
> 
> So you're effectively clearing the *RX queue* number (as this is before
> dev_pick_tx()) in order to influence TX queue selection.
> 
No, you're not seeing it properly.  In bonding (as with all stacked devices) we
make two passes through dev_pick_tx.

1) The first time we call dev_pick_tx is when the network stack calls
dev_queue_xmit on the bond device.  here we call ndo_select_queue, which calls
bond_select_queue.  This method tells the stack which queue to enqueue the skb
on for the bond device.  We can use tc's skbedit action to select a particular
queue on the bond device for various classes of traffic, and those queues
correspond to individual slave devices.

2) the second time we call dev_pick_tx is when the bonding bond_queue_xmit
routine (which is the bonding drivers ndo_start_xmit method, called after the
frist dev_pick_tx), calls dev_queue_xmit, passing the slave net_device pointer).
In this case we do whatever the slave device has configured (either reset the
queue to zero, or call the slaves ndo_select queue method).

What I'm fixing is the fact that the bonding drivers queue_mappnig value is
'leaking' down to the slave.

Lets say, for example we're bonding two multiqueue devices, and have the bonding
driver configured to send all traffic to 192.168.1.1 over the first slave (which
we can accomplish using an appropriate tc rule on the bond device, setting
frames with that dest ip to have a skb->queue_mapping of 1).

In the above example, frames bound for 192.168.1.1 have skb->queue_mapping set
to 1 when they enter bond_queue_xmit.  bond_queue_xmit, calls
dev_queue_xmit(slave, skb), which calls dev_pick_tx(slave, skb).  If we (for
simplicity's sake) assume the slave has no ndo_select_queue method, dev_pick_tx
calls __skb_tx_hash to determine the output queue.  But the bonding driver
already set skb->queue_mapping to 1 (because it wanted this frame output on the
first slave, not because it wanted to transmit the frame on the slaves tx queue
1).  Nevertheless, __skb_tx_hash() sees that skb_rx_queue_recorded returns true
here and as a result, calls skb_rx_get_queue, which returns 0.  Because of that
we wind up transmitting on hardware queue 0 all the time, which is not at all
what we want.  What we want is for the bonding driver to be able to use
queue_mapping for its own purposes, and then allow the physical device to make
its own output queue selection independently.  To do this, queue_mapping needs
to be zero, prior to queuenig the skb to the slave, which is exactly what this
patch does.

> Here, the bonding device seems to be behaving as a forwarding device.
> If TX queue selection can go wrong for certain combinations of queue
> configuration when forwarding, then this is a problem for IP forwarding
> and bridging as well, isn't it?
> 
Potentially, yes.  I only fixed this because I was looking at bonding and its
queue_mapping behavior, and saw that this needed fixing.  Bridging and IP
forwarding should also likely clear the queue mapping in the forwarding path
somewhere to avoid selecting an output tx queue that is a function of whatever
queue and device it arrived on during ingress.  I've not yet looked to see if
thats already being done.

Neil

> Ben.
> 
> -- 
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan June 2, 2011, 7:52 p.m. UTC | #5
Le 02/06/2011 21:46, Neil Horman a écrit :
> On Thu, Jun 02, 2011 at 08:09:49PM +0100, Ben Hutchings wrote:
>> On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote:
>>> On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote:
>>>> On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote:
>>>>> The bonding driver is multiqueue enabled, in which each queue represents a slave
>>>>> to enable optional steering of output frames to given slaves against the default
>>>>> output policy.  However, it needs to reset the skb->queue_mapping prior to
>>>>> queuing to the physical device or the physical slave (if it is multiqueue) could
>>>>> wind up transmitting on an unintended tx queue (one that was reserved for
>>>>> specific traffic classes for instance)
>> [...]
>>>> So far as I can see, this has no effect, because dev_queue_xmit() always
>>>> sets queue_mapping (in dev_pick_tx()).
>>>>
>>> it resets the queue mapping exactly as you would expect it to.  bonding is a
>>> multiqueue enabled device and selects a potentially non-zero queue based on the
>>> output of bond_select_queue.
>>>
>>>> What is the problem you're seeing?
>>>>
>>> The problem is exctly that.  dev_pick_tx() on the bond device sets the
>>> queue_mapping as per the result of bond_select_queue (the ndo_select_queue
>>> method for the bonding driver).  The implementation there is based on the use of
>>> tc with bonding, so that output slaves can be selected for certain types of
>>> traffic.  But when that mechanism is used, skb->queue_mapping is preserved when
>>> the bonding driver queues the frame to the underlying slave.  This denies the
>>> slave (if its also a multiqueue device) the opportunity to reselect the queue
>>> properly, because of this call path:
>>>
>>> bond_queue_xmit
>>>   dev_queue_xmit(slave_dev)
>>>    dev_pick_tx()
>>>     skb_tx_hash()
>>>      __skb_tx_hash()
>>>
>>> __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping
>>> based on what the bonding driver chose using its own internal logic.  Since
>>> bonding uses the multiqueue infrastructure to do slave output selection without
>>> any regard for slave output queue selection, it seems to me we should really
>>> reset the queue mapping to zero so the slave device can pick its own tx queue.
>>
>> So you're effectively clearing the *RX queue* number (as this is before
>> dev_pick_tx()) in order to influence TX queue selection.
>>
> No, you're not seeing it properly.  In bonding (as with all stacked devices) we
> make two passes through dev_pick_tx.
>
> 1) The first time we call dev_pick_tx is when the network stack calls
> dev_queue_xmit on the bond device.  here we call ndo_select_queue, which calls
> bond_select_queue.  This method tells the stack which queue to enqueue the skb
> on for the bond device.  We can use tc's skbedit action to select a particular
> queue on the bond device for various classes of traffic, and those queues
> correspond to individual slave devices.
>
> 2) the second time we call dev_pick_tx is when the bonding bond_queue_xmit
> routine (which is the bonding drivers ndo_start_xmit method, called after the
> frist dev_pick_tx), calls dev_queue_xmit, passing the slave net_device pointer).
> In this case we do whatever the slave device has configured (either reset the
> queue to zero, or call the slaves ndo_select queue method).
>
> What I'm fixing is the fact that the bonding drivers queue_mappnig value is
> 'leaking' down to the slave.
>
> Lets say, for example we're bonding two multiqueue devices, and have the bonding
> driver configured to send all traffic to 192.168.1.1 over the first slave (which
> we can accomplish using an appropriate tc rule on the bond device, setting
> frames with that dest ip to have a skb->queue_mapping of 1).
>
> In the above example, frames bound for 192.168.1.1 have skb->queue_mapping set
> to 1 when they enter bond_queue_xmit.  bond_queue_xmit, calls
> dev_queue_xmit(slave, skb), which calls dev_pick_tx(slave, skb).  If we (for
> simplicity's sake) assume the slave has no ndo_select_queue method, dev_pick_tx
> calls __skb_tx_hash to determine the output queue.  But the bonding driver
> already set skb->queue_mapping to 1 (because it wanted this frame output on the
> first slave, not because it wanted to transmit the frame on the slaves tx queue
> 1).  Nevertheless, __skb_tx_hash() sees that skb_rx_queue_recorded returns true
> here and as a result, calls skb_rx_get_queue, which returns 0.  Because of that
> we wind up transmitting on hardware queue 0 all the time, which is not at all
> what we want.  What we want is for the bonding driver to be able to use
> queue_mapping for its own purposes, and then allow the physical device to make
> its own output queue selection independently.  To do this, queue_mapping needs
> to be zero, prior to queuenig the skb to the slave, which is exactly what this
> patch does.

This sounds good to me.

Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>

>
>> Here, the bonding device seems to be behaving as a forwarding device.
>> If TX queue selection can go wrong for certain combinations of queue
>> configuration when forwarding, then this is a problem for IP forwarding
>> and bridging as well, isn't it?
>>
> Potentially, yes.  I only fixed this because I was looking at bonding and its
> queue_mapping behavior, and saw that this needed fixing.  Bridging and IP
> forwarding should also likely clear the queue mapping in the forwarding path
> somewhere to avoid selecting an output tx queue that is a function of whatever
> queue and device it arrived on during ingress.  I've not yet looked to see if
> thats already being done.
>
> Neil
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 2, 2011, 7:59 p.m. UTC | #6
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 02 Jun 2011 19:35:53 +0100

> So far as I can see, this has no effect, because dev_queue_xmit() always
> sets queue_mapping (in dev_pick_tx()).

__skb_tx_hash() uses any hash value recorded in the SKB before trying
to manually it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 2, 2011, 8:04 p.m. UTC | #7
From: Neil Horman <nhorman@tuxdriver.com>
Date: Thu, 2 Jun 2011 15:46:21 -0400

> Potentially, yes.  I only fixed this because I was looking at bonding and its
> queue_mapping behavior, and saw that this needed fixing.  Bridging and IP
> forwarding should also likely clear the queue mapping in the forwarding path
> somewhere to avoid selecting an output tx queue that is a function of whatever
> queue and device it arrived on during ingress.  I've not yet looked to see if
> thats already being done.

No we do not do this, intentionally.

That way the input classification and queue selection is mirrored
on the transmit side, which we absolutely want to happen.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 2, 2011, 8:07 p.m. UTC | #8
From: Neil Horman <nhorman@tuxdriver.com>
Date: Thu,  2 Jun 2011 14:03:19 -0400

> The bonding driver is multiqueue enabled, in which each queue represents a slave
> to enable optional steering of output frames to given slaves against the default
> output policy.  However, it needs to reset the skb->queue_mapping prior to
> queuing to the physical device or the physical slave (if it is multiqueue) could
> wind up transmitting on an unintended tx queue (one that was reserved for
> specific traffic classes for instance)
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

Since, as I mentioned, the idea when we are forwarding and bridging is that
we use the input receive classification to influence the spread on transmit,
I think things like this bonding case should remember the rxhash setting
before they override it and then restore that value right before invoking
dev_queue_xmit().
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings June 2, 2011, 8:13 p.m. UTC | #9
On Thu, 2011-06-02 at 15:46 -0400, Neil Horman wrote:
> On Thu, Jun 02, 2011 at 08:09:49PM +0100, Ben Hutchings wrote:
> > On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote:
> > > On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote:
> > > > On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote:
> > > > > The bonding driver is multiqueue enabled, in which each queue represents a slave
> > > > > to enable optional steering of output frames to given slaves against the default
> > > > > output policy.  However, it needs to reset the skb->queue_mapping prior to
> > > > > queuing to the physical device or the physical slave (if it is multiqueue) could
> > > > > wind up transmitting on an unintended tx queue (one that was reserved for
> > > > > specific traffic classes for instance)
> > [...]
> > > > So far as I can see, this has no effect, because dev_queue_xmit() always
> > > > sets queue_mapping (in dev_pick_tx()).
> > > > 
> > > it resets the queue mapping exactly as you would expect it to.  bonding is a
> > > multiqueue enabled device and selects a potentially non-zero queue based on the
> > > output of bond_select_queue.
> > > 
> > > > What is the problem you're seeing?
> > > > 
> > > The problem is exctly that.  dev_pick_tx() on the bond device sets the
> > > queue_mapping as per the result of bond_select_queue (the ndo_select_queue
> > > method for the bonding driver).  The implementation there is based on the use of
> > > tc with bonding, so that output slaves can be selected for certain types of
> > > traffic.  But when that mechanism is used, skb->queue_mapping is preserved when
> > > the bonding driver queues the frame to the underlying slave.  This denies the
> > > slave (if its also a multiqueue device) the opportunity to reselect the queue
> > > properly, because of this call path:
> > > 
> > > bond_queue_xmit
> > >  dev_queue_xmit(slave_dev)
> > >   dev_pick_tx()
> > >    skb_tx_hash()
> > >     __skb_tx_hash()
> > > 
> > > __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping
> > > based on what the bonding driver chose using its own internal logic.  Since
> > > bonding uses the multiqueue infrastructure to do slave output selection without
> > > any regard for slave output queue selection, it seems to me we should really
> > > reset the queue mapping to zero so the slave device can pick its own tx queue.
> > 
> > So you're effectively clearing the *RX queue* number (as this is before
> > dev_pick_tx()) in order to influence TX queue selection.
> > 
> No, you're not seeing it properly.  In bonding (as with all stacked devices) we
> make two passes through dev_pick_tx.

Only if the stacked device has its own software queues and schedulers;
e.g. VLAN devices don't.

> 1) The first time we call dev_pick_tx is when the network stack calls
> dev_queue_xmit on the bond device.  here we call ndo_select_queue, which calls
> bond_select_queue.  This method tells the stack which queue to enqueue the skb
> on for the bond device.  We can use tc's skbedit action to select a particular
> queue on the bond device for various classes of traffic, and those queues
> correspond to individual slave devices.
> 
> 2) the second time we call dev_pick_tx is when the bonding bond_queue_xmit
> routine (which is the bonding drivers ndo_start_xmit method, called after the
> frist dev_pick_tx), calls dev_queue_xmit, passing the slave net_device pointer).
> In this case we do whatever the slave device has configured (either reset the
> queue to zero, or call the slaves ndo_select queue method).
> 
> What I'm fixing is the fact that the bonding drivers queue_mappnig value is
> 'leaking' down to the slave.

And this is because dev_queue_xmit() assumes that it's a record of the
RX queue number.

The differing interpretation of queue_mapping for RX and TX is annoying
and I think we should change the initial value of queue_mapping to -1.
But that's a separate issue.

> Lets say, for example we're bonding two multiqueue devices, and have the bonding
> driver configured to send all traffic to 192.168.1.1 over the first slave (which
> we can accomplish using an appropriate tc rule on the bond device, setting
> frames with that dest ip to have a skb->queue_mapping of 1).
> 
> In the above example, frames bound for 192.168.1.1 have skb->queue_mapping set
> to 1 when they enter bond_queue_xmit.  bond_queue_xmit, calls
> dev_queue_xmit(slave, skb), which calls dev_pick_tx(slave, skb).  If we (for
> simplicity's sake) assume the slave has no ndo_select_queue method, dev_pick_tx
> calls __skb_tx_hash to determine the output queue.  But the bonding driver
> already set skb->queue_mapping to 1 (because it wanted this frame output on the
> first slave, not because it wanted to transmit the frame on the slaves tx queue
> 1).  Nevertheless, __skb_tx_hash() sees that skb_rx_queue_recorded returns true
> here and as a result, calls skb_rx_get_queue, which returns 0.  Because of that
> we wind up transmitting on hardware queue 0 all the time, which is not at all
> what we want.  What we want is for the bonding driver to be able to use
> queue_mapping for its own purposes, and then allow the physical device to make
> its own output queue selection independently.
[...]

I think I understand - the bonding device is effectively single-queue
and shouldn't record an RX queue number.  But I think you should define
and use skb_clear_rx_queue() to set queue_mapping=0, rather than abusing
skb_set_queue_mapping() which is meant to take a TX queue number.

Ben.
Nicolas de Pesloüan June 2, 2011, 8:13 p.m. UTC | #10
Le 02/06/2011 22:04, David Miller a écrit :
> From: Neil Horman<nhorman@tuxdriver.com>
> Date: Thu, 2 Jun 2011 15:46:21 -0400
>
>> Potentially, yes.  I only fixed this because I was looking at bonding and its
>> queue_mapping behavior, and saw that this needed fixing.  Bridging and IP
>> forwarding should also likely clear the queue mapping in the forwarding path
>> somewhere to avoid selecting an output tx queue that is a function of whatever
>> queue and device it arrived on during ingress.  I've not yet looked to see if
>> thats already being done.
>
> No we do not do this, intentionally.
>
> That way the input classification and queue selection is mirrored
> on the transmit side, which we absolutely want to happen.

Can you confirm that this is the expected behavior for IP forwarding and bridge but not for bonding?

To be more precise, due to the way bonding use queue mapping for slave selection, it is desirable to 
clear the mapping before sending to the slave, because the meaning of the mapping for the slave 
interface might be really different from the meaning for the bonding interface. Arguably, this is 
the mapping usage in bonding which is "different" from other usages, but...

	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan June 2, 2011, 8:22 p.m. UTC | #11
Le 02/06/2011 22:07, David Miller a écrit :
> From: Neil Horman<nhorman@tuxdriver.com>
> Date: Thu,  2 Jun 2011 14:03:19 -0400
>
>> The bonding driver is multiqueue enabled, in which each queue represents a slave
>> to enable optional steering of output frames to given slaves against the default
>> output policy.  However, it needs to reset the skb->queue_mapping prior to
>> queuing to the physical device or the physical slave (if it is multiqueue) could
>> wind up transmitting on an unintended tx queue (one that was reserved for
>> specific traffic classes for instance)
>>
>> Signed-off-by: Neil Horman<nhorman@tuxdriver.com>
>
> Since, as I mentioned, the idea when we are forwarding and bridging is that
> we use the input receive classification to influence the spread on transmit,
> I think things like this bonding case should remember the rxhash setting
> before they override it and then restore that value right before invoking
> dev_queue_xmit().

Ok, now I understand. Maybe, using queue mapping for special slave selection wasn't such a good idea 
at the very beginning, because it pollutes the RX mapping that is expected to be propagated up to 
TX. Restoring the original value before invoking dev_queue_xmit() would fix this, but I'm not sure 
it is the cleanest way to do it.

	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 2, 2011, 8:46 p.m. UTC | #12
From: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com>
Date: Thu, 02 Jun 2011 22:13:57 +0200

> To be more precise, due to the way bonding use queue mapping for slave
> selection, it is desirable to clear the mapping before sending to the
> slave, because the meaning of the mapping for the slave interface
> might be really different from the meaning for the bonding
> interface. Arguably, this is the mapping usage in bonding which is
> "different" from other usages, but...

This just confirms my reasoning behind why I wanted to discourage
drivers from providing explicit ->ndo_select_queue() methods unless
absolutely necessary.

Information now gets lost in cases like this bonding issue.

Bonding should definitely, as I suggested, remember the original
rxhash value and restore it when sending to the slave.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings June 2, 2011, 8:51 p.m. UTC | #13
On Thu, 2011-06-02 at 13:46 -0700, David Miller wrote:
> From: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com>
> Date: Thu, 02 Jun 2011 22:13:57 +0200
> 
> > To be more precise, due to the way bonding use queue mapping for slave
> > selection, it is desirable to clear the mapping before sending to the
> > slave, because the meaning of the mapping for the slave interface
> > might be really different from the meaning for the bonding
> > interface. Arguably, this is the mapping usage in bonding which is
> > "different" from other usages, but...
> 
> This just confirms my reasoning behind why I wanted to discourage
> drivers from providing explicit ->ndo_select_queue() methods unless
> absolutely necessary.
> 
> Information now gets lost in cases like this bonding issue.
> 
> Bonding should definitely, as I suggested, remember the original
> rxhash value and restore it when sending to the slave.

Surely RX queue (queue_mapping), not RX hash (rxhash, which is unchanged
anyway AFAIK).

Ben.
David Miller June 2, 2011, 9:10 p.m. UTC | #14
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 02 Jun 2011 21:51:46 +0100

> On Thu, 2011-06-02 at 13:46 -0700, David Miller wrote:
>> From: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com>
>> Date: Thu, 02 Jun 2011 22:13:57 +0200
>> 
>> > To be more precise, due to the way bonding use queue mapping for slave
>> > selection, it is desirable to clear the mapping before sending to the
>> > slave, because the meaning of the mapping for the slave interface
>> > might be really different from the meaning for the bonding
>> > interface. Arguably, this is the mapping usage in bonding which is
>> > "different" from other usages, but...
>> 
>> This just confirms my reasoning behind why I wanted to discourage
>> drivers from providing explicit ->ndo_select_queue() methods unless
>> absolutely necessary.
>> 
>> Information now gets lost in cases like this bonding issue.
>> 
>> Bonding should definitely, as I suggested, remember the original
>> rxhash value and restore it when sending to the slave.
> 
> Surely RX queue (queue_mapping), not RX hash (rxhash, which is unchanged
> anyway AFAIK).

Right.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman June 3, 2011, 1:04 a.m. UTC | #15
On Thu, Jun 02, 2011 at 01:07:10PM -0700, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Thu,  2 Jun 2011 14:03:19 -0400
> 
> > The bonding driver is multiqueue enabled, in which each queue represents a slave
> > to enable optional steering of output frames to given slaves against the default
> > output policy.  However, it needs to reset the skb->queue_mapping prior to
> > queuing to the physical device or the physical slave (if it is multiqueue) could
> > wind up transmitting on an unintended tx queue (one that was reserved for
> > specific traffic classes for instance)
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> Since, as I mentioned, the idea when we are forwarding and bridging is that
> we use the input receive classification to influence the spread on transmit,
> I think things like this bonding case should remember the rxhash setting
> before they override it and then restore that value right before invoking
> dev_queue_xmit().
> 
Ok, I can respin the patch to do that.  I'll handle it in the AM.

Thanks
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman June 3, 2011, 1:16 a.m. UTC | #16
On Thu, Jun 02, 2011 at 09:13:24PM +0100, Ben Hutchings wrote:
> On Thu, 2011-06-02 at 15:46 -0400, Neil Horman wrote:
> > On Thu, Jun 02, 2011 at 08:09:49PM +0100, Ben Hutchings wrote:
> > > On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote:
> > > > On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote:
> > > > > On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote:
> > > > > > The bonding driver is multiqueue enabled, in which each queue represents a slave
> > > > > > to enable optional steering of output frames to given slaves against the default
> > > > > > output policy.  However, it needs to reset the skb->queue_mapping prior to
> > > > > > queuing to the physical device or the physical slave (if it is multiqueue) could
> > > > > > wind up transmitting on an unintended tx queue (one that was reserved for
> > > > > > specific traffic classes for instance)
> > > [...]
> > > > > So far as I can see, this has no effect, because dev_queue_xmit() always
> > > > > sets queue_mapping (in dev_pick_tx()).
> > > > > 
> > > > it resets the queue mapping exactly as you would expect it to.  bonding is a
> > > > multiqueue enabled device and selects a potentially non-zero queue based on the
> > > > output of bond_select_queue.
> > > > 
> > > > > What is the problem you're seeing?
> > > > > 
> > > > The problem is exctly that.  dev_pick_tx() on the bond device sets the
> > > > queue_mapping as per the result of bond_select_queue (the ndo_select_queue
> > > > method for the bonding driver).  The implementation there is based on the use of
> > > > tc with bonding, so that output slaves can be selected for certain types of
> > > > traffic.  But when that mechanism is used, skb->queue_mapping is preserved when
> > > > the bonding driver queues the frame to the underlying slave.  This denies the
> > > > slave (if its also a multiqueue device) the opportunity to reselect the queue
> > > > properly, because of this call path:
> > > > 
> > > > bond_queue_xmit
> > > >  dev_queue_xmit(slave_dev)
> > > >   dev_pick_tx()
> > > >    skb_tx_hash()
> > > >     __skb_tx_hash()
> > > > 
> > > > __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping
> > > > based on what the bonding driver chose using its own internal logic.  Since
> > > > bonding uses the multiqueue infrastructure to do slave output selection without
> > > > any regard for slave output queue selection, it seems to me we should really
> > > > reset the queue mapping to zero so the slave device can pick its own tx queue.
> > > 
> > > So you're effectively clearing the *RX queue* number (as this is before
> > > dev_pick_tx()) in order to influence TX queue selection.
> > > 
> > No, you're not seeing it properly.  In bonding (as with all stacked devices) we
> > make two passes through dev_pick_tx.
> 
> Only if the stacked device has its own software queues and schedulers;
> e.g. VLAN devices don't.
> 
Every net_device has a default fifo qdisc, and as such, each net_device when
called with dev_queue_xmit calls dev_pick_tx, and that includes vlans.  It just
so happens that vlans are a single queue device and come out of dev_pick_tx with
a queue_mapping of 0.  Thats all moot of course, since we're talking about
bonding.

> > 1) The first time we call dev_pick_tx is when the network stack calls
> > dev_queue_xmit on the bond device.  here we call ndo_select_queue, which calls
> > bond_select_queue.  This method tells the stack which queue to enqueue the skb
> > on for the bond device.  We can use tc's skbedit action to select a particular
> > queue on the bond device for various classes of traffic, and those queues
> > correspond to individual slave devices.
> > 
> > 2) the second time we call dev_pick_tx is when the bonding bond_queue_xmit
> > routine (which is the bonding drivers ndo_start_xmit method, called after the
> > frist dev_pick_tx), calls dev_queue_xmit, passing the slave net_device pointer).
> > In this case we do whatever the slave device has configured (either reset the
> > queue to zero, or call the slaves ndo_select queue method).
> > 
> > What I'm fixing is the fact that the bonding drivers queue_mappnig value is
> > 'leaking' down to the slave.
> 
> And this is because dev_queue_xmit() assumes that it's a record of the
> RX queue number.
> 
Yes, correct.  stacked devices provide the opportuninty for queue_mapping to no
longer be that rx queue number, but rather a tx queue mapping.  This is a
potentential problem with all stacked devices, but bonding is the only thing
showing it right now because its the only stacked device that can set a non-zero
queue_mapping value.

> The differing interpretation of queue_mapping for RX and TX is annoying
> and I think we should change the initial value of queue_mapping to -1.
> But that's a separate issue.
> 
Agreed.

> > Lets say, for example we're bonding two multiqueue devices, and have the bonding
> > driver configured to send all traffic to 192.168.1.1 over the first slave (which
> > we can accomplish using an appropriate tc rule on the bond device, setting
> > frames with that dest ip to have a skb->queue_mapping of 1).
> > 
> > In the above example, frames bound for 192.168.1.1 have skb->queue_mapping set
> > to 1 when they enter bond_queue_xmit.  bond_queue_xmit, calls
> > dev_queue_xmit(slave, skb), which calls dev_pick_tx(slave, skb).  If we (for
> > simplicity's sake) assume the slave has no ndo_select_queue method, dev_pick_tx
> > calls __skb_tx_hash to determine the output queue.  But the bonding driver
> > already set skb->queue_mapping to 1 (because it wanted this frame output on the
> > first slave, not because it wanted to transmit the frame on the slaves tx queue
> > 1).  Nevertheless, __skb_tx_hash() sees that skb_rx_queue_recorded returns true
> > here and as a result, calls skb_rx_get_queue, which returns 0.  Because of that
> > we wind up transmitting on hardware queue 0 all the time, which is not at all
> > what we want.  What we want is for the bonding driver to be able to use
> > queue_mapping for its own purposes, and then allow the physical device to make
> > its own output queue selection independently.
> [...]
> 
> I think I understand - the bonding device is effectively single-queue
> and shouldn't record an RX queue number.  But I think you should define
Well, no, its a multiqueue device, it just happens to treat slaves as queues,
rather than internal hardware objects.  But I suppose you're saying the same
thing that dave is, in that we should restore the origional queue_mapping,
rather than clearing it to preserve the input semantics when we reach the
physical device.  I can do that.

Neil
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 17b4dd9..812c70c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4200,6 +4200,11 @@  static inline int bond_slave_override(struct bonding *bond,
 	/* If the slave isn't UP, use default transmit policy. */
 	if (slave && slave->queue_id && IS_UP(slave->dev) &&
 	    (slave->link == BOND_LINK_UP)) {
+		/*
+ 		 * Reset the queue mapping to allow the underlying slave	
+ 		 * the chance to re-select an appropriate tx queue
+ 		 */
+		skb_set_queue_mapping(skb, 0);
 		res = bond_dev_queue_xmit(bond, skb, slave->dev);
 	}