diff mbox

[RFC,net-next] net: core: Pass XPS select queue decision to skb_tx_hash

Message ID 1459290252-4121-1-git-send-email-saeedm@mellanox.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Saeed Mahameed March 29, 2016, 10:24 p.m. UTC
Currently XPS select queue decision is final and overrides/ignores all other
select queue parameters such as QoS TC, RX recording.

This patch makes get_xps_queue value as a hint for skb_tx_hash, which will decide
whether to use this hint as is or to tweak it a little to provide the correct TXQ.

This will fix bugs in cases such that TC QoS offload (tc_to_txq mapping) and XPS mapping
are configured but select queue will only respect the XPS configuration which will skip
the TC QoS queue selection and thus will not satisfy the QoS configuration.

RFC because I want to discuss how we would like the final behavior of the
__netdev_pick_tx, with this patch it goes as follows:

netdev_pick_tx(skb):
        hint = get_xps_queue
        txq = skb_tx_hash(skb, hint)

skb_tx_hash(skb, hint):
        if (skb_rx_queue_recorded(skb))
                return skb_get_rx_queue(skb);

        queue_offset = 0;
        if (dev->num_tc)
                queue_offset = tc_queue_offset[tc];

        hash = hint < 0 ? skb_get_hash(skb) : hint;
        return hash + queue_offset;

i.e: instead of blindly return the XPS decision, we pass it to skb_tx_hash which can make the final
decision for us, select queue will now respect and combine XPS and TC QoS tc_to_txq mappings.

Also there is one additional behavioral change that recorded rx queues will
now override the XPS configuration.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |    2 +-
 include/linux/netdevice.h                  |    6 +++---
 net/core/dev.c                             |   10 ++++++----
 3 files changed, 10 insertions(+), 8 deletions(-)

Comments

John Fastabend March 30, 2016, 12:18 a.m. UTC | #1
On 16-03-29 03:24 PM, Saeed Mahameed wrote:
> Currently XPS select queue decision is final and overrides/ignores all other
> select queue parameters such as QoS TC, RX recording.
> 
> This patch makes get_xps_queue value as a hint for skb_tx_hash, which will decide
> whether to use this hint as is or to tweak it a little to provide the correct TXQ.
> 
> This will fix bugs in cases such that TC QoS offload (tc_to_txq mapping) and XPS mapping
> are configured but select queue will only respect the XPS configuration which will skip
> the TC QoS queue selection and thus will not satisfy the QoS configuration.
> 
> RFC because I want to discuss how we would like the final behavior of the
> __netdev_pick_tx, with this patch it goes as follows:
> 
> netdev_pick_tx(skb):
>         hint = get_xps_queue
>         txq = skb_tx_hash(skb, hint)
> 
> skb_tx_hash(skb, hint):
>         if (skb_rx_queue_recorded(skb))
>                 return skb_get_rx_queue(skb);
> 
>         queue_offset = 0;
>         if (dev->num_tc)
>                 queue_offset = tc_queue_offset[tc];
> 
>         hash = hint < 0 ? skb_get_hash(skb) : hint;
>         return hash + queue_offset;
> 
> i.e: instead of blindly return the XPS decision, we pass it to skb_tx_hash which can make the final
> decision for us, select queue will now respect and combine XPS and TC QoS tc_to_txq mappings.
> 
> Also there is one additional behavioral change that recorded rx queues will
> now override the XPS configuration.
> 
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c |    2 +-
>  include/linux/netdevice.h                  |    6 +++---
>  net/core/dev.c                             |   10 ++++++----
>  3 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index c0d7b72..873cf49 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -693,7 +693,7 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
>  	u8 up = 0;
>  
>  	if (dev->num_tc)
> -		return skb_tx_hash(dev, skb);
> +		return skb_tx_hash(dev, skb, -1);
>  

I would prefer to not have another strange quirk users have to remember
in order to do tx classification. So with this change depending on the
driver the queue selection precedence changes. In short I agree with
the problem statement but think we can find a better solution.

One idea that comes to mind is we can have a tc action to force the
queue selection? Now that we have the egress tc hook it would probably
be fairly cheap to implement and if users want this behavior they can
ask for it explicitly.

If your thinking about tc stuff we could fix the tooling to set this
action when ever dcb is turned on or hardware rate limiting is enabled,
etc. And even if we wanted we could have the driver add the rule in the
cases where firmware protocols are configuring the QOS/etc.


>  	if (skb_vlan_tag_present(skb))
>  		up = skb_vlan_tag_get(skb) >> VLAN_PRIO_SHIFT;
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index cb0d5d0..ad81ffe 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3130,16 +3130,16 @@ static inline int netif_set_xps_queue(struct net_device *dev,
>  #endif
>  
>  u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb,
> -		  unsigned int num_tx_queues);
> +		  unsigned int num_tx_queues, int txq_hint);
>  

[...]

And all this seems like it would only ever be called by drivers select
queue routines which I really wish we could kill off one of these days
instead of add to. Now if the signal is something higher in the stack
and not the driver I think it is OK.

.John
Saeed Mahameed March 30, 2016, 1:23 p.m. UTC | #2
On 3/30/2016 3:18 AM, John Fastabend wrote:
> I would prefer to not have another strange quirk users have to 
> remember in order to do tx classification. So with this change 
> depending on the driver the queue selection precedence changes. 
This change doesn't depend on the driver it affects all drivers that 
implement the select queue ndo and use the default fallback 
"pick_tx_queue" which this patch came to fix, or any driver that doesn't 
implement the ndo (the fallback is the default in this case).
> In short I agree with the problem statement but think we can find a 
> better solution. One idea that comes to mind is we can have a tc 
> action to force the queue selection? Now that we have the egress tc 
> hook it would probably be fairly cheap to implement and if users want 
> this behavior they can ask for it explicitly. If your thinking about 
> tc stuff we could fix the tooling to set this action when ever dcb is 
> turned on or hardware rate limiting is enabled, etc. And even if we 
> wanted we could have the driver add the rule in the cases where 
> firmware protocols are configuring the QOS/etc. 
Why would you ask for a bug fix explicitly ? IMHO this how I expect the 
pick _tx_queue routine to behave, why would I disable XPS in order for 
select queue to choose according TC QoS ?
as this patch suggests we can benefit from both without any additional 
tooling !

>>   	if (skb_vlan_tag_present(skb))
>>   		up = skb_vlan_tag_get(skb) >> VLAN_PRIO_SHIFT;
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index cb0d5d0..ad81ffe 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3130,16 +3130,16 @@ static inline int netif_set_xps_queue(struct net_device *dev,
>>   #endif
>>   
>>   u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb,
>> -		  unsigned int num_tx_queues);
>> +		  unsigned int num_tx_queues, int txq_hint);
>>   
> [...]
>
> And all this seems like it would only ever be called by drivers select
> queue routines which I really wish we could kill off one of these days
> instead of add to. Now if the signal is something higher in the stack
> and not the driver I think it is OK.
I agree, drivers shouldn't call this function, the only reason drivers 
call this function is to bypass get_xps_queue
and after this patch i don't think driver will need to call it, since it 
will be called even if XPS is configured.
John Fastabend March 30, 2016, 5:04 p.m. UTC | #3
On 16-03-30 06:23 AM, Saeed Mahameed wrote:
> 
> 
> On 3/30/2016 3:18 AM, John Fastabend wrote:
>> I would prefer to not have another strange quirk users have to
>> remember in order to do tx classification. So with this change
>> depending on the driver the queue selection precedence changes. 

> This change doesn't depend on the driver it affects all drivers that
> implement the select queue ndo and use the default fallback
> "pick_tx_queue" which this patch came to fix, or any driver that doesn't
> implement the ndo (the fallback is the default in this case).

Yep, sorry I read the patch to quickly and without coffee thanks!

>> In short I agree with the problem statement but think we can find a
>> better solution. One idea that comes to mind is we can have a tc
>> action to force the queue selection? Now that we have the egress tc
>> hook it would probably be fairly cheap to implement and if users want
>> this behavior they can ask for it explicitly. If your thinking about
>> tc stuff we could fix the tooling to set this action when ever dcb is
>> turned on or hardware rate limiting is enabled, etc. And even if we
>> wanted we could have the driver add the rule in the cases where
>> firmware protocols are configuring the QOS/etc. 

> Why would you ask for a bug fix explicitly ? IMHO this how I expect the
> pick _tx_queue routine to behave, why would I disable XPS in order for
> select queue to choose according TC QoS ?
> as this patch suggests we can benefit from both without any additional
> tooling !
> 

OK, so let me see if I get this right now. This was the precedence
before the patch in the normal no select queue case,

	(1) socket mapping sk_tx_queue_mapping iff !ooo_okay
	(2) xps
	(3) skb->queue_mapping
	(4) qoffset/qcount (hash over tc queues)
	(5) hash over num_tx_queues

With this patch the precedence is a bit changed because
skb_tx_hash is always called.

	(1) socket mapping sk_tx_queue_mapping iff !ooo_okay
	(2) skb->queue_mapping
	(3) qoffset/qcount
	   (hash over tc queues if xps choice is > qcount)
	(4) xps
	(5) hash over num_tx_queues

Sound right? Nice thing about this with correct configuration
of tc with qcount = xps_queues it sort of works as at least
I expect it to. I think the question is are people OK with
letting skb->queue_mapping take precedence. I am at least
because it makes the skb edit queue_mapping action from tc
easier to use.

And just a comment on the code why not just move get_xps_queue
into skb_tx_hash at this point if its always being called as the
"hint". Then we avoid calling it in the case queue_mapping is
set.

>>>       if (skb_vlan_tag_present(skb))
>>>           up = skb_vlan_tag_get(skb) >> VLAN_PRIO_SHIFT;
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index cb0d5d0..ad81ffe 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -3130,16 +3130,16 @@ static inline int netif_set_xps_queue(struct
>>> net_device *dev,
>>>   #endif
>>>     u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb,
>>> -          unsigned int num_tx_queues);
>>> +          unsigned int num_tx_queues, int txq_hint);
>>>   
>> [...]
>>
>> And all this seems like it would only ever be called by drivers select
>> queue routines which I really wish we could kill off one of these days
>> instead of add to. Now if the signal is something higher in the stack
>> and not the driver I think it is OK.
> I agree, drivers shouldn't call this function, the only reason drivers
> call this function is to bypass get_xps_queue
> and after this patch i don't think driver will need to call it, since it
> will be called even if XPS is configured.
> 

yep.
Saeed Mahameed March 30, 2016, 6:30 p.m. UTC | #4
On Wed, Mar 30, 2016 at 8:04 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
>
> OK, so let me see if I get this right now. This was the precedence
> before the patch in the normal no select queue case,
>
>         (1) socket mapping sk_tx_queue_mapping iff !ooo_okay
>         (2) xps
>         (3) skb->queue_mapping
>         (4) qoffset/qcount (hash over tc queues)
>         (5) hash over num_tx_queues
>
> With this patch the precedence is a bit changed because
> skb_tx_hash is always called.
>
>         (1) socket mapping sk_tx_queue_mapping iff !ooo_okay
>         (2) skb->queue_mapping
>         (3) qoffset/qcount
>            (hash over tc queues if xps choice is > qcount)
>         (4) xps
>         (5) hash over num_tx_queues
>
> Sound right? Nice thing about this with correct configuration
> of tc with qcount = xps_queues it sort of works as at least

Yes !
for qcount = xps_queues which almost all drivers default
configurations goes this way, it works like charm, xps selects the
exact TC TX queue at the correct offset without any need for further
SKB hashing.
and even if by mistake XPS was also configured on TC TX queue then
this patch will detect that the xps hash is out of this TC
offset/qcount range and will re-hash. But i don't see why would user
or driver do such strange configuration.

> I expect it to. I think the question is are people OK with
> letting skb->queue_mapping take precedence. I am at least
> because it makes the skb edit queue_mapping action from tc
> easier to use.
>

skb->queue_mapping toke precedence also before this patch, the only
thing this patch came to change is how to compute the txq when
skb->queue_mapping is not present, so we don't need to worry about
this.

> And just a comment on the code why not just move get_xps_queue
> into skb_tx_hash at this point if its always being called as the
> "hint". Then we avoid calling it in the case queue_mapping is
> set.
>

Very good point, the only place that calls skb_tx_hash(dev, skb) other
than __netdev_pick_tx is mlx4 driver and they did it there just
because they wanted to bypass XPS configuration if TC QoS is
configured, with this fix we don't have to bypass XPS at all for when
TC is configured.

I will change it.
John Fastabend April 1, 2016, 3:49 a.m. UTC | #5
On 16-03-30 11:30 AM, Saeed Mahameed wrote:
> On Wed, Mar 30, 2016 at 8:04 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>>
>> OK, so let me see if I get this right now. This was the precedence
>> before the patch in the normal no select queue case,
>>
>>         (1) socket mapping sk_tx_queue_mapping iff !ooo_okay
>>         (2) xps
>>         (3) skb->queue_mapping
>>         (4) qoffset/qcount (hash over tc queues)
>>         (5) hash over num_tx_queues
>>
>> With this patch the precedence is a bit changed because
>> skb_tx_hash is always called.
>>
>>         (1) socket mapping sk_tx_queue_mapping iff !ooo_okay
>>         (2) skb->queue_mapping
>>         (3) qoffset/qcount
>>            (hash over tc queues if xps choice is > qcount)
>>         (4) xps
>>         (5) hash over num_tx_queues
>>
>> Sound right? Nice thing about this with correct configuration
>> of tc with qcount = xps_queues it sort of works as at least
> 
> Yes !
> for qcount = xps_queues which almost all drivers default
> configurations goes this way, it works like charm, xps selects the
> exact TC TX queue at the correct offset without any need for further
> SKB hashing.
> and even if by mistake XPS was also configured on TC TX queue then
> this patch will detect that the xps hash is out of this TC
> offset/qcount range and will re-hash. But i don't see why would user
> or driver do such strange configuration.
> 
>> I expect it to. I think the question is are people OK with
>> letting skb->queue_mapping take precedence. I am at least
>> because it makes the skb edit queue_mapping action from tc
>> easier to use.
>>
> 
> skb->queue_mapping toke precedence also before this patch, the only
> thing this patch came to change is how to compute the txq when
> skb->queue_mapping is not present, so we don't need to worry about
> this.
> 

I don't believe that is correct in the general case. Perhaps
in the ndo_select_queue path though. See this line,

        if (queue_index < 0 || skb->ooo_okay ||
            queue_index >= dev->real_num_tx_queues) {
                int new_index = get_xps_queue(dev, skb);
                if (new_index < 0)
                        new_index = skb_tx_hash(dev, skb);

The skb_tx_hash() routine is never called if xps is enabled.
And so we never get into the call to do this,

        if (skb_rx_queue_recorded(skb)) {
                hash = skb_get_rx_queue(skb);
                while (unlikely(hash >= num_tx_queues))
                        hash -= num_tx_queues;
                return hash;
        }

Right? FWIW I think that using queue_mapping before xps is better
because we can use tc to pick the queue_mapping them programmatically
if we want for these special cases instead if wanted.

>> And just a comment on the code why not just move get_xps_queue
>> into skb_tx_hash at this point if its always being called as the
>> "hint". Then we avoid calling it in the case queue_mapping is
>> set.
>>
> 
> Very good point, the only place that calls skb_tx_hash(dev, skb) other
> than __netdev_pick_tx is mlx4 driver and they did it there just
> because they wanted to bypass XPS configuration if TC QoS is
> configured, with this fix we don't have to bypass XPS at all for when
> TC is configured.
> 
> I will change it.
> 

Great thanks.
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index c0d7b72..873cf49 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -693,7 +693,7 @@  u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
 	u8 up = 0;
 
 	if (dev->num_tc)
-		return skb_tx_hash(dev, skb);
+		return skb_tx_hash(dev, skb, -1);
 
 	if (skb_vlan_tag_present(skb))
 		up = skb_vlan_tag_get(skb) >> VLAN_PRIO_SHIFT;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb0d5d0..ad81ffe 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3130,16 +3130,16 @@  static inline int netif_set_xps_queue(struct net_device *dev,
 #endif
 
 u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb,
-		  unsigned int num_tx_queues);
+		  unsigned int num_tx_queues, int txq_hint);
 
 /*
  * Returns a Tx hash for the given packet when dev->real_num_tx_queues is used
  * as a distribution range limit for the returned value.
  */
 static inline u16 skb_tx_hash(const struct net_device *dev,
-			      struct sk_buff *skb)
+			      struct sk_buff *skb, int txq_hint)
 {
-	return __skb_tx_hash(dev, skb, dev->real_num_tx_queues);
+	return __skb_tx_hash(dev, skb, dev->real_num_tx_queues, txq_hint);
 }
 
 /**
diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe7..ff640b7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2394,7 +2394,7 @@  EXPORT_SYMBOL(netif_device_attach);
  * to be used as a distribution range.
  */
 u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb,
-		  unsigned int num_tx_queues)
+		  unsigned int num_tx_queues, int txq_hint)
 {
 	u32 hash;
 	u16 qoffset = 0;
@@ -2411,9 +2411,12 @@  u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb,
 		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
 		qoffset = dev->tc_to_txq[tc].offset;
 		qcount = dev->tc_to_txq[tc].count;
+		if (txq_hint >= qcount) /* This can happen when xps is configured on TC TXQs */
+			txq_hint = -1; /* recalculate TXQ hash */
 	}
 
-	return (u16) reciprocal_scale(skb_get_hash(skb), qcount) + qoffset;
+	hash = txq_hint < 0 ? reciprocal_scale(skb_get_hash(skb), qcount) : txq_hint;
+	return (u16)(hash) + qoffset;
 }
 EXPORT_SYMBOL(__skb_tx_hash);
 
@@ -3201,8 +3204,7 @@  static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
 	if (queue_index < 0 || skb->ooo_okay ||
 	    queue_index >= dev->real_num_tx_queues) {
 		int new_index = get_xps_queue(dev, skb);
-		if (new_index < 0)
-			new_index = skb_tx_hash(dev, skb);
+		new_index = skb_tx_hash(dev, skb, new_index);
 
 		if (queue_index != new_index && sk &&
 		    sk_fullsock(sk) &&