Message ID | 1307037799-32315-1-git-send-email-nhorman@tuxdriver.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote: > The bonding driver is multiqueue enabled, in which each queue represents a slave > to enable optional steering of output frames to given slaves against the default > output policy. However, it needs to reset the skb->queue_mapping prior to > queuing to the physical device or the physical slave (if it is multiqueue) could > wind up transmitting on an unintended tx queue (one that was reserved for > specific traffic classes for instance) > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > CC: Jay Vosburgh <fubar@us.ibm.com> > CC: Andy Gospodarek <andy@greyhouse.net> > CC: "David S. Miller" <davem@davemloft.net> > --- > drivers/net/bonding/bond_main.c | 5 +++++ > 1 files changed, 5 insertions(+), 0 deletions(-) > > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c > index 17b4dd9..812c70c 100644 > --- a/drivers/net/bonding/bond_main.c > +++ b/drivers/net/bonding/bond_main.c > @@ -4200,6 +4200,11 @@ static inline int bond_slave_override(struct bonding *bond, > /* If the slave isn't UP, use default transmit policy. */ > if (slave && slave->queue_id && IS_UP(slave->dev) && > (slave->link == BOND_LINK_UP)) { > + /* > + * Reset the queue mapping to allow the underlying slave > + * the chance to re-select an appropriate tx queue > + */ > + skb_set_queue_mapping(skb, 0); > res = bond_dev_queue_xmit(bond, skb, slave->dev); > } > So far as I can see, this has no effect, because dev_queue_xmit() always sets queue_mapping (in dev_pick_tx()). What is the problem you're seeing? Ben.
On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote: > On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote: > > The bonding driver is multiqueue enabled, in which each queue represents a slave > > to enable optional steering of output frames to given slaves against the default > > output policy. However, it needs to reset the skb->queue_mapping prior to > > queuing to the physical device or the physical slave (if it is multiqueue) could > > wind up transmitting on an unintended tx queue (one that was reserved for > > specific traffic classes for instance) > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > CC: Jay Vosburgh <fubar@us.ibm.com> > > CC: Andy Gospodarek <andy@greyhouse.net> > > CC: "David S. Miller" <davem@davemloft.net> > > --- > > drivers/net/bonding/bond_main.c | 5 +++++ > > 1 files changed, 5 insertions(+), 0 deletions(-) > > > > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c > > index 17b4dd9..812c70c 100644 > > --- a/drivers/net/bonding/bond_main.c > > +++ b/drivers/net/bonding/bond_main.c > > @@ -4200,6 +4200,11 @@ static inline int bond_slave_override(struct bonding *bond, > > /* If the slave isn't UP, use default transmit policy. */ > > if (slave && slave->queue_id && IS_UP(slave->dev) && > > (slave->link == BOND_LINK_UP)) { > > + /* > > + * Reset the queue mapping to allow the underlying slave > > + * the chance to re-select an appropriate tx queue > > + */ > > + skb_set_queue_mapping(skb, 0); > > res = bond_dev_queue_xmit(bond, skb, slave->dev); > > } > > > > So far as I can see, this has no effect, because dev_queue_xmit() always > sets queue_mapping (in dev_pick_tx()). > it resets the queue mapping exactly as you would expect it to. bonding is a multiqueue enabled device and selects a potentially non-zero queue based on the output of bond_select_queue. > What is the problem you're seeing? > The problem is exctly that. dev_pick_tx() on the bond device sets the queue_mapping as per the result of bond_select_queue (the ndo_select_queue method for the bonding driver). The implementation there is based on the use of tc with bonding, so that output slaves can be selected for certain types of traffic. But when that mechanism is used, skb->queue_mapping is preserved when the bonding driver queues the frame to the underlying slave. This denies the slave (if its also a multiqueue device) the opportunity to reselect the queue properly, because of this call path: bond_queue_xmit dev_queue_xmit(slave_dev) dev_pick_tx() skb_tx_hash() __skb_tx_hash() __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping based on what the bonding driver chose using its own internal logic. Since bonding uses the multiqueue infrastructure to do slave output selection without any regard for slave output queue selection, it seems to me we should really reset the queue mapping to zero so the slave device can pick its own tx queue. Neil > Ben. > > -- > Ben Hutchings, Senior Software Engineer, Solarflare > Not speaking for my employer; that's the marketing department's job. > They asked us to note that Solarflare product names are trademarked. > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote: > On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote: > > On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote: > > > The bonding driver is multiqueue enabled, in which each queue represents a slave > > > to enable optional steering of output frames to given slaves against the default > > > output policy. However, it needs to reset the skb->queue_mapping prior to > > > queuing to the physical device or the physical slave (if it is multiqueue) could > > > wind up transmitting on an unintended tx queue (one that was reserved for > > > specific traffic classes for instance) [...] > > So far as I can see, this has no effect, because dev_queue_xmit() always > > sets queue_mapping (in dev_pick_tx()). > > > it resets the queue mapping exactly as you would expect it to. bonding is a > multiqueue enabled device and selects a potentially non-zero queue based on the > output of bond_select_queue. > > > What is the problem you're seeing? > > > The problem is exctly that. dev_pick_tx() on the bond device sets the > queue_mapping as per the result of bond_select_queue (the ndo_select_queue > method for the bonding driver). The implementation there is based on the use of > tc with bonding, so that output slaves can be selected for certain types of > traffic. But when that mechanism is used, skb->queue_mapping is preserved when > the bonding driver queues the frame to the underlying slave. This denies the > slave (if its also a multiqueue device) the opportunity to reselect the queue > properly, because of this call path: > > bond_queue_xmit > dev_queue_xmit(slave_dev) > dev_pick_tx() > skb_tx_hash() > __skb_tx_hash() > > __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping > based on what the bonding driver chose using its own internal logic. Since > bonding uses the multiqueue infrastructure to do slave output selection without > any regard for slave output queue selection, it seems to me we should really > reset the queue mapping to zero so the slave device can pick its own tx queue. So you're effectively clearing the *RX queue* number (as this is before dev_pick_tx()) in order to influence TX queue selection. Here, the bonding device seems to be behaving as a forwarding device. If TX queue selection can go wrong for certain combinations of queue configuration when forwarding, then this is a problem for IP forwarding and bridging as well, isn't it? Ben.
On Thu, Jun 02, 2011 at 08:09:49PM +0100, Ben Hutchings wrote: > On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote: > > On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote: > > > On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote: > > > > The bonding driver is multiqueue enabled, in which each queue represents a slave > > > > to enable optional steering of output frames to given slaves against the default > > > > output policy. However, it needs to reset the skb->queue_mapping prior to > > > > queuing to the physical device or the physical slave (if it is multiqueue) could > > > > wind up transmitting on an unintended tx queue (one that was reserved for > > > > specific traffic classes for instance) > [...] > > > So far as I can see, this has no effect, because dev_queue_xmit() always > > > sets queue_mapping (in dev_pick_tx()). > > > > > it resets the queue mapping exactly as you would expect it to. bonding is a > > multiqueue enabled device and selects a potentially non-zero queue based on the > > output of bond_select_queue. > > > > > What is the problem you're seeing? > > > > > The problem is exctly that. dev_pick_tx() on the bond device sets the > > queue_mapping as per the result of bond_select_queue (the ndo_select_queue > > method for the bonding driver). The implementation there is based on the use of > > tc with bonding, so that output slaves can be selected for certain types of > > traffic. But when that mechanism is used, skb->queue_mapping is preserved when > > the bonding driver queues the frame to the underlying slave. This denies the > > slave (if its also a multiqueue device) the opportunity to reselect the queue > > properly, because of this call path: > > > > bond_queue_xmit > > dev_queue_xmit(slave_dev) > > dev_pick_tx() > > skb_tx_hash() > > __skb_tx_hash() > > > > __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping > > based on what the bonding driver chose using its own internal logic. Since > > bonding uses the multiqueue infrastructure to do slave output selection without > > any regard for slave output queue selection, it seems to me we should really > > reset the queue mapping to zero so the slave device can pick its own tx queue. > > So you're effectively clearing the *RX queue* number (as this is before > dev_pick_tx()) in order to influence TX queue selection. > No, you're not seeing it properly. In bonding (as with all stacked devices) we make two passes through dev_pick_tx. 1) The first time we call dev_pick_tx is when the network stack calls dev_queue_xmit on the bond device. here we call ndo_select_queue, which calls bond_select_queue. This method tells the stack which queue to enqueue the skb on for the bond device. We can use tc's skbedit action to select a particular queue on the bond device for various classes of traffic, and those queues correspond to individual slave devices. 2) the second time we call dev_pick_tx is when the bonding bond_queue_xmit routine (which is the bonding drivers ndo_start_xmit method, called after the frist dev_pick_tx), calls dev_queue_xmit, passing the slave net_device pointer). In this case we do whatever the slave device has configured (either reset the queue to zero, or call the slaves ndo_select queue method). What I'm fixing is the fact that the bonding drivers queue_mappnig value is 'leaking' down to the slave. Lets say, for example we're bonding two multiqueue devices, and have the bonding driver configured to send all traffic to 192.168.1.1 over the first slave (which we can accomplish using an appropriate tc rule on the bond device, setting frames with that dest ip to have a skb->queue_mapping of 1). In the above example, frames bound for 192.168.1.1 have skb->queue_mapping set to 1 when they enter bond_queue_xmit. bond_queue_xmit, calls dev_queue_xmit(slave, skb), which calls dev_pick_tx(slave, skb). If we (for simplicity's sake) assume the slave has no ndo_select_queue method, dev_pick_tx calls __skb_tx_hash to determine the output queue. But the bonding driver already set skb->queue_mapping to 1 (because it wanted this frame output on the first slave, not because it wanted to transmit the frame on the slaves tx queue 1). Nevertheless, __skb_tx_hash() sees that skb_rx_queue_recorded returns true here and as a result, calls skb_rx_get_queue, which returns 0. Because of that we wind up transmitting on hardware queue 0 all the time, which is not at all what we want. What we want is for the bonding driver to be able to use queue_mapping for its own purposes, and then allow the physical device to make its own output queue selection independently. To do this, queue_mapping needs to be zero, prior to queuenig the skb to the slave, which is exactly what this patch does. > Here, the bonding device seems to be behaving as a forwarding device. > If TX queue selection can go wrong for certain combinations of queue > configuration when forwarding, then this is a problem for IP forwarding > and bridging as well, isn't it? > Potentially, yes. I only fixed this because I was looking at bonding and its queue_mapping behavior, and saw that this needed fixing. Bridging and IP forwarding should also likely clear the queue mapping in the forwarding path somewhere to avoid selecting an output tx queue that is a function of whatever queue and device it arrived on during ingress. I've not yet looked to see if thats already being done. Neil > Ben. > > -- > Ben Hutchings, Senior Software Engineer, Solarflare > Not speaking for my employer; that's the marketing department's job. > They asked us to note that Solarflare product names are trademarked. > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le 02/06/2011 21:46, Neil Horman a écrit : > On Thu, Jun 02, 2011 at 08:09:49PM +0100, Ben Hutchings wrote: >> On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote: >>> On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote: >>>> On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote: >>>>> The bonding driver is multiqueue enabled, in which each queue represents a slave >>>>> to enable optional steering of output frames to given slaves against the default >>>>> output policy. However, it needs to reset the skb->queue_mapping prior to >>>>> queuing to the physical device or the physical slave (if it is multiqueue) could >>>>> wind up transmitting on an unintended tx queue (one that was reserved for >>>>> specific traffic classes for instance) >> [...] >>>> So far as I can see, this has no effect, because dev_queue_xmit() always >>>> sets queue_mapping (in dev_pick_tx()). >>>> >>> it resets the queue mapping exactly as you would expect it to. bonding is a >>> multiqueue enabled device and selects a potentially non-zero queue based on the >>> output of bond_select_queue. >>> >>>> What is the problem you're seeing? >>>> >>> The problem is exctly that. dev_pick_tx() on the bond device sets the >>> queue_mapping as per the result of bond_select_queue (the ndo_select_queue >>> method for the bonding driver). The implementation there is based on the use of >>> tc with bonding, so that output slaves can be selected for certain types of >>> traffic. But when that mechanism is used, skb->queue_mapping is preserved when >>> the bonding driver queues the frame to the underlying slave. This denies the >>> slave (if its also a multiqueue device) the opportunity to reselect the queue >>> properly, because of this call path: >>> >>> bond_queue_xmit >>> dev_queue_xmit(slave_dev) >>> dev_pick_tx() >>> skb_tx_hash() >>> __skb_tx_hash() >>> >>> __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping >>> based on what the bonding driver chose using its own internal logic. Since >>> bonding uses the multiqueue infrastructure to do slave output selection without >>> any regard for slave output queue selection, it seems to me we should really >>> reset the queue mapping to zero so the slave device can pick its own tx queue. >> >> So you're effectively clearing the *RX queue* number (as this is before >> dev_pick_tx()) in order to influence TX queue selection. >> > No, you're not seeing it properly. In bonding (as with all stacked devices) we > make two passes through dev_pick_tx. > > 1) The first time we call dev_pick_tx is when the network stack calls > dev_queue_xmit on the bond device. here we call ndo_select_queue, which calls > bond_select_queue. This method tells the stack which queue to enqueue the skb > on for the bond device. We can use tc's skbedit action to select a particular > queue on the bond device for various classes of traffic, and those queues > correspond to individual slave devices. > > 2) the second time we call dev_pick_tx is when the bonding bond_queue_xmit > routine (which is the bonding drivers ndo_start_xmit method, called after the > frist dev_pick_tx), calls dev_queue_xmit, passing the slave net_device pointer). > In this case we do whatever the slave device has configured (either reset the > queue to zero, or call the slaves ndo_select queue method). > > What I'm fixing is the fact that the bonding drivers queue_mappnig value is > 'leaking' down to the slave. > > Lets say, for example we're bonding two multiqueue devices, and have the bonding > driver configured to send all traffic to 192.168.1.1 over the first slave (which > we can accomplish using an appropriate tc rule on the bond device, setting > frames with that dest ip to have a skb->queue_mapping of 1). > > In the above example, frames bound for 192.168.1.1 have skb->queue_mapping set > to 1 when they enter bond_queue_xmit. bond_queue_xmit, calls > dev_queue_xmit(slave, skb), which calls dev_pick_tx(slave, skb). If we (for > simplicity's sake) assume the slave has no ndo_select_queue method, dev_pick_tx > calls __skb_tx_hash to determine the output queue. But the bonding driver > already set skb->queue_mapping to 1 (because it wanted this frame output on the > first slave, not because it wanted to transmit the frame on the slaves tx queue > 1). Nevertheless, __skb_tx_hash() sees that skb_rx_queue_recorded returns true > here and as a result, calls skb_rx_get_queue, which returns 0. Because of that > we wind up transmitting on hardware queue 0 all the time, which is not at all > what we want. What we want is for the bonding driver to be able to use > queue_mapping for its own purposes, and then allow the physical device to make > its own output queue selection independently. To do this, queue_mapping needs > to be zero, prior to queuenig the skb to the slave, which is exactly what this > patch does. This sounds good to me. Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr> > >> Here, the bonding device seems to be behaving as a forwarding device. >> If TX queue selection can go wrong for certain combinations of queue >> configuration when forwarding, then this is a problem for IP forwarding >> and bridging as well, isn't it? >> > Potentially, yes. I only fixed this because I was looking at bonding and its > queue_mapping behavior, and saw that this needed fixing. Bridging and IP > forwarding should also likely clear the queue mapping in the forwarding path > somewhere to avoid selecting an output tx queue that is a function of whatever > queue and device it arrived on during ingress. I've not yet looked to see if > thats already being done. > > Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Ben Hutchings <bhutchings@solarflare.com> Date: Thu, 02 Jun 2011 19:35:53 +0100 > So far as I can see, this has no effect, because dev_queue_xmit() always > sets queue_mapping (in dev_pick_tx()). __skb_tx_hash() uses any hash value recorded in the SKB before trying to manually it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Neil Horman <nhorman@tuxdriver.com> Date: Thu, 2 Jun 2011 15:46:21 -0400 > Potentially, yes. I only fixed this because I was looking at bonding and its > queue_mapping behavior, and saw that this needed fixing. Bridging and IP > forwarding should also likely clear the queue mapping in the forwarding path > somewhere to avoid selecting an output tx queue that is a function of whatever > queue and device it arrived on during ingress. I've not yet looked to see if > thats already being done. No we do not do this, intentionally. That way the input classification and queue selection is mirrored on the transmit side, which we absolutely want to happen. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Neil Horman <nhorman@tuxdriver.com> Date: Thu, 2 Jun 2011 14:03:19 -0400 > The bonding driver is multiqueue enabled, in which each queue represents a slave > to enable optional steering of output frames to given slaves against the default > output policy. However, it needs to reset the skb->queue_mapping prior to > queuing to the physical device or the physical slave (if it is multiqueue) could > wind up transmitting on an unintended tx queue (one that was reserved for > specific traffic classes for instance) > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Since, as I mentioned, the idea when we are forwarding and bridging is that we use the input receive classification to influence the spread on transmit, I think things like this bonding case should remember the rxhash setting before they override it and then restore that value right before invoking dev_queue_xmit(). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-06-02 at 15:46 -0400, Neil Horman wrote: > On Thu, Jun 02, 2011 at 08:09:49PM +0100, Ben Hutchings wrote: > > On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote: > > > On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote: > > > > On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote: > > > > > The bonding driver is multiqueue enabled, in which each queue represents a slave > > > > > to enable optional steering of output frames to given slaves against the default > > > > > output policy. However, it needs to reset the skb->queue_mapping prior to > > > > > queuing to the physical device or the physical slave (if it is multiqueue) could > > > > > wind up transmitting on an unintended tx queue (one that was reserved for > > > > > specific traffic classes for instance) > > [...] > > > > So far as I can see, this has no effect, because dev_queue_xmit() always > > > > sets queue_mapping (in dev_pick_tx()). > > > > > > > it resets the queue mapping exactly as you would expect it to. bonding is a > > > multiqueue enabled device and selects a potentially non-zero queue based on the > > > output of bond_select_queue. > > > > > > > What is the problem you're seeing? > > > > > > > The problem is exctly that. dev_pick_tx() on the bond device sets the > > > queue_mapping as per the result of bond_select_queue (the ndo_select_queue > > > method for the bonding driver). The implementation there is based on the use of > > > tc with bonding, so that output slaves can be selected for certain types of > > > traffic. But when that mechanism is used, skb->queue_mapping is preserved when > > > the bonding driver queues the frame to the underlying slave. This denies the > > > slave (if its also a multiqueue device) the opportunity to reselect the queue > > > properly, because of this call path: > > > > > > bond_queue_xmit > > > dev_queue_xmit(slave_dev) > > > dev_pick_tx() > > > skb_tx_hash() > > > __skb_tx_hash() > > > > > > __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping > > > based on what the bonding driver chose using its own internal logic. Since > > > bonding uses the multiqueue infrastructure to do slave output selection without > > > any regard for slave output queue selection, it seems to me we should really > > > reset the queue mapping to zero so the slave device can pick its own tx queue. > > > > So you're effectively clearing the *RX queue* number (as this is before > > dev_pick_tx()) in order to influence TX queue selection. > > > No, you're not seeing it properly. In bonding (as with all stacked devices) we > make two passes through dev_pick_tx. Only if the stacked device has its own software queues and schedulers; e.g. VLAN devices don't. > 1) The first time we call dev_pick_tx is when the network stack calls > dev_queue_xmit on the bond device. here we call ndo_select_queue, which calls > bond_select_queue. This method tells the stack which queue to enqueue the skb > on for the bond device. We can use tc's skbedit action to select a particular > queue on the bond device for various classes of traffic, and those queues > correspond to individual slave devices. > > 2) the second time we call dev_pick_tx is when the bonding bond_queue_xmit > routine (which is the bonding drivers ndo_start_xmit method, called after the > frist dev_pick_tx), calls dev_queue_xmit, passing the slave net_device pointer). > In this case we do whatever the slave device has configured (either reset the > queue to zero, or call the slaves ndo_select queue method). > > What I'm fixing is the fact that the bonding drivers queue_mappnig value is > 'leaking' down to the slave. And this is because dev_queue_xmit() assumes that it's a record of the RX queue number. The differing interpretation of queue_mapping for RX and TX is annoying and I think we should change the initial value of queue_mapping to -1. But that's a separate issue. > Lets say, for example we're bonding two multiqueue devices, and have the bonding > driver configured to send all traffic to 192.168.1.1 over the first slave (which > we can accomplish using an appropriate tc rule on the bond device, setting > frames with that dest ip to have a skb->queue_mapping of 1). > > In the above example, frames bound for 192.168.1.1 have skb->queue_mapping set > to 1 when they enter bond_queue_xmit. bond_queue_xmit, calls > dev_queue_xmit(slave, skb), which calls dev_pick_tx(slave, skb). If we (for > simplicity's sake) assume the slave has no ndo_select_queue method, dev_pick_tx > calls __skb_tx_hash to determine the output queue. But the bonding driver > already set skb->queue_mapping to 1 (because it wanted this frame output on the > first slave, not because it wanted to transmit the frame on the slaves tx queue > 1). Nevertheless, __skb_tx_hash() sees that skb_rx_queue_recorded returns true > here and as a result, calls skb_rx_get_queue, which returns 0. Because of that > we wind up transmitting on hardware queue 0 all the time, which is not at all > what we want. What we want is for the bonding driver to be able to use > queue_mapping for its own purposes, and then allow the physical device to make > its own output queue selection independently. [...] I think I understand - the bonding device is effectively single-queue and shouldn't record an RX queue number. But I think you should define and use skb_clear_rx_queue() to set queue_mapping=0, rather than abusing skb_set_queue_mapping() which is meant to take a TX queue number. Ben.
Le 02/06/2011 22:04, David Miller a écrit : > From: Neil Horman<nhorman@tuxdriver.com> > Date: Thu, 2 Jun 2011 15:46:21 -0400 > >> Potentially, yes. I only fixed this because I was looking at bonding and its >> queue_mapping behavior, and saw that this needed fixing. Bridging and IP >> forwarding should also likely clear the queue mapping in the forwarding path >> somewhere to avoid selecting an output tx queue that is a function of whatever >> queue and device it arrived on during ingress. I've not yet looked to see if >> thats already being done. > > No we do not do this, intentionally. > > That way the input classification and queue selection is mirrored > on the transmit side, which we absolutely want to happen. Can you confirm that this is the expected behavior for IP forwarding and bridge but not for bonding? To be more precise, due to the way bonding use queue mapping for slave selection, it is desirable to clear the mapping before sending to the slave, because the meaning of the mapping for the slave interface might be really different from the meaning for the bonding interface. Arguably, this is the mapping usage in bonding which is "different" from other usages, but... Nicolas. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le 02/06/2011 22:07, David Miller a écrit : > From: Neil Horman<nhorman@tuxdriver.com> > Date: Thu, 2 Jun 2011 14:03:19 -0400 > >> The bonding driver is multiqueue enabled, in which each queue represents a slave >> to enable optional steering of output frames to given slaves against the default >> output policy. However, it needs to reset the skb->queue_mapping prior to >> queuing to the physical device or the physical slave (if it is multiqueue) could >> wind up transmitting on an unintended tx queue (one that was reserved for >> specific traffic classes for instance) >> >> Signed-off-by: Neil Horman<nhorman@tuxdriver.com> > > Since, as I mentioned, the idea when we are forwarding and bridging is that > we use the input receive classification to influence the spread on transmit, > I think things like this bonding case should remember the rxhash setting > before they override it and then restore that value right before invoking > dev_queue_xmit(). Ok, now I understand. Maybe, using queue mapping for special slave selection wasn't such a good idea at the very beginning, because it pollutes the RX mapping that is expected to be propagated up to TX. Restoring the original value before invoking dev_queue_xmit() would fix this, but I'm not sure it is the cleanest way to do it. Nicolas. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> Date: Thu, 02 Jun 2011 22:13:57 +0200 > To be more precise, due to the way bonding use queue mapping for slave > selection, it is desirable to clear the mapping before sending to the > slave, because the meaning of the mapping for the slave interface > might be really different from the meaning for the bonding > interface. Arguably, this is the mapping usage in bonding which is > "different" from other usages, but... This just confirms my reasoning behind why I wanted to discourage drivers from providing explicit ->ndo_select_queue() methods unless absolutely necessary. Information now gets lost in cases like this bonding issue. Bonding should definitely, as I suggested, remember the original rxhash value and restore it when sending to the slave. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-06-02 at 13:46 -0700, David Miller wrote: > From: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> > Date: Thu, 02 Jun 2011 22:13:57 +0200 > > > To be more precise, due to the way bonding use queue mapping for slave > > selection, it is desirable to clear the mapping before sending to the > > slave, because the meaning of the mapping for the slave interface > > might be really different from the meaning for the bonding > > interface. Arguably, this is the mapping usage in bonding which is > > "different" from other usages, but... > > This just confirms my reasoning behind why I wanted to discourage > drivers from providing explicit ->ndo_select_queue() methods unless > absolutely necessary. > > Information now gets lost in cases like this bonding issue. > > Bonding should definitely, as I suggested, remember the original > rxhash value and restore it when sending to the slave. Surely RX queue (queue_mapping), not RX hash (rxhash, which is unchanged anyway AFAIK). Ben.
From: Ben Hutchings <bhutchings@solarflare.com> Date: Thu, 02 Jun 2011 21:51:46 +0100 > On Thu, 2011-06-02 at 13:46 -0700, David Miller wrote: >> From: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> >> Date: Thu, 02 Jun 2011 22:13:57 +0200 >> >> > To be more precise, due to the way bonding use queue mapping for slave >> > selection, it is desirable to clear the mapping before sending to the >> > slave, because the meaning of the mapping for the slave interface >> > might be really different from the meaning for the bonding >> > interface. Arguably, this is the mapping usage in bonding which is >> > "different" from other usages, but... >> >> This just confirms my reasoning behind why I wanted to discourage >> drivers from providing explicit ->ndo_select_queue() methods unless >> absolutely necessary. >> >> Information now gets lost in cases like this bonding issue. >> >> Bonding should definitely, as I suggested, remember the original >> rxhash value and restore it when sending to the slave. > > Surely RX queue (queue_mapping), not RX hash (rxhash, which is unchanged > anyway AFAIK). Right. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 02, 2011 at 01:07:10PM -0700, David Miller wrote: > From: Neil Horman <nhorman@tuxdriver.com> > Date: Thu, 2 Jun 2011 14:03:19 -0400 > > > The bonding driver is multiqueue enabled, in which each queue represents a slave > > to enable optional steering of output frames to given slaves against the default > > output policy. However, it needs to reset the skb->queue_mapping prior to > > queuing to the physical device or the physical slave (if it is multiqueue) could > > wind up transmitting on an unintended tx queue (one that was reserved for > > specific traffic classes for instance) > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > Since, as I mentioned, the idea when we are forwarding and bridging is that > we use the input receive classification to influence the spread on transmit, > I think things like this bonding case should remember the rxhash setting > before they override it and then restore that value right before invoking > dev_queue_xmit(). > Ok, I can respin the patch to do that. I'll handle it in the AM. Thanks Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 02, 2011 at 09:13:24PM +0100, Ben Hutchings wrote: > On Thu, 2011-06-02 at 15:46 -0400, Neil Horman wrote: > > On Thu, Jun 02, 2011 at 08:09:49PM +0100, Ben Hutchings wrote: > > > On Thu, 2011-06-02 at 14:56 -0400, Neil Horman wrote: > > > > On Thu, Jun 02, 2011 at 07:35:53PM +0100, Ben Hutchings wrote: > > > > > On Thu, 2011-06-02 at 14:03 -0400, Neil Horman wrote: > > > > > > The bonding driver is multiqueue enabled, in which each queue represents a slave > > > > > > to enable optional steering of output frames to given slaves against the default > > > > > > output policy. However, it needs to reset the skb->queue_mapping prior to > > > > > > queuing to the physical device or the physical slave (if it is multiqueue) could > > > > > > wind up transmitting on an unintended tx queue (one that was reserved for > > > > > > specific traffic classes for instance) > > > [...] > > > > > So far as I can see, this has no effect, because dev_queue_xmit() always > > > > > sets queue_mapping (in dev_pick_tx()). > > > > > > > > > it resets the queue mapping exactly as you would expect it to. bonding is a > > > > multiqueue enabled device and selects a potentially non-zero queue based on the > > > > output of bond_select_queue. > > > > > > > > > What is the problem you're seeing? > > > > > > > > > The problem is exctly that. dev_pick_tx() on the bond device sets the > > > > queue_mapping as per the result of bond_select_queue (the ndo_select_queue > > > > method for the bonding driver). The implementation there is based on the use of > > > > tc with bonding, so that output slaves can be selected for certain types of > > > > traffic. But when that mechanism is used, skb->queue_mapping is preserved when > > > > the bonding driver queues the frame to the underlying slave. This denies the > > > > slave (if its also a multiqueue device) the opportunity to reselect the queue > > > > properly, because of this call path: > > > > > > > > bond_queue_xmit > > > > dev_queue_xmit(slave_dev) > > > > dev_pick_tx() > > > > skb_tx_hash() > > > > __skb_tx_hash() > > > > > > > > __skb_tx_hash sees that skb_queue_recorded returns true, and assigns a hardware queue mapping > > > > based on what the bonding driver chose using its own internal logic. Since > > > > bonding uses the multiqueue infrastructure to do slave output selection without > > > > any regard for slave output queue selection, it seems to me we should really > > > > reset the queue mapping to zero so the slave device can pick its own tx queue. > > > > > > So you're effectively clearing the *RX queue* number (as this is before > > > dev_pick_tx()) in order to influence TX queue selection. > > > > > No, you're not seeing it properly. In bonding (as with all stacked devices) we > > make two passes through dev_pick_tx. > > Only if the stacked device has its own software queues and schedulers; > e.g. VLAN devices don't. > Every net_device has a default fifo qdisc, and as such, each net_device when called with dev_queue_xmit calls dev_pick_tx, and that includes vlans. It just so happens that vlans are a single queue device and come out of dev_pick_tx with a queue_mapping of 0. Thats all moot of course, since we're talking about bonding. > > 1) The first time we call dev_pick_tx is when the network stack calls > > dev_queue_xmit on the bond device. here we call ndo_select_queue, which calls > > bond_select_queue. This method tells the stack which queue to enqueue the skb > > on for the bond device. We can use tc's skbedit action to select a particular > > queue on the bond device for various classes of traffic, and those queues > > correspond to individual slave devices. > > > > 2) the second time we call dev_pick_tx is when the bonding bond_queue_xmit > > routine (which is the bonding drivers ndo_start_xmit method, called after the > > frist dev_pick_tx), calls dev_queue_xmit, passing the slave net_device pointer). > > In this case we do whatever the slave device has configured (either reset the > > queue to zero, or call the slaves ndo_select queue method). > > > > What I'm fixing is the fact that the bonding drivers queue_mappnig value is > > 'leaking' down to the slave. > > And this is because dev_queue_xmit() assumes that it's a record of the > RX queue number. > Yes, correct. stacked devices provide the opportuninty for queue_mapping to no longer be that rx queue number, but rather a tx queue mapping. This is a potentential problem with all stacked devices, but bonding is the only thing showing it right now because its the only stacked device that can set a non-zero queue_mapping value. > The differing interpretation of queue_mapping for RX and TX is annoying > and I think we should change the initial value of queue_mapping to -1. > But that's a separate issue. > Agreed. > > Lets say, for example we're bonding two multiqueue devices, and have the bonding > > driver configured to send all traffic to 192.168.1.1 over the first slave (which > > we can accomplish using an appropriate tc rule on the bond device, setting > > frames with that dest ip to have a skb->queue_mapping of 1). > > > > In the above example, frames bound for 192.168.1.1 have skb->queue_mapping set > > to 1 when they enter bond_queue_xmit. bond_queue_xmit, calls > > dev_queue_xmit(slave, skb), which calls dev_pick_tx(slave, skb). If we (for > > simplicity's sake) assume the slave has no ndo_select_queue method, dev_pick_tx > > calls __skb_tx_hash to determine the output queue. But the bonding driver > > already set skb->queue_mapping to 1 (because it wanted this frame output on the > > first slave, not because it wanted to transmit the frame on the slaves tx queue > > 1). Nevertheless, __skb_tx_hash() sees that skb_rx_queue_recorded returns true > > here and as a result, calls skb_rx_get_queue, which returns 0. Because of that > > we wind up transmitting on hardware queue 0 all the time, which is not at all > > what we want. What we want is for the bonding driver to be able to use > > queue_mapping for its own purposes, and then allow the physical device to make > > its own output queue selection independently. > [...] > > I think I understand - the bonding device is effectively single-queue > and shouldn't record an RX queue number. But I think you should define Well, no, its a multiqueue device, it just happens to treat slaves as queues, rather than internal hardware objects. But I suppose you're saying the same thing that dave is, in that we should restore the origional queue_mapping, rather than clearing it to preserve the input semantics when we reach the physical device. I can do that. Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 17b4dd9..812c70c 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -4200,6 +4200,11 @@ static inline int bond_slave_override(struct bonding *bond, /* If the slave isn't UP, use default transmit policy. */ if (slave && slave->queue_id && IS_UP(slave->dev) && (slave->link == BOND_LINK_UP)) { + /* + * Reset the queue mapping to allow the underlying slave + * the chance to re-select an appropriate tx queue + */ + skb_set_queue_mapping(skb, 0); res = bond_dev_queue_xmit(bond, skb, slave->dev); }
The bonding driver is multiqueue enabled, in which each queue represents a slave to enable optional steering of output frames to given slaves against the default output policy. However, it needs to reset the skb->queue_mapping prior to queuing to the physical device or the physical slave (if it is multiqueue) could wind up transmitting on an unintended tx queue (one that was reserved for specific traffic classes for instance) Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> --- drivers/net/bonding/bond_main.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-)