diff mbox

[7/8] net: support tx_ring per UP in HW based QoS mechanism

Message ID 1331659323-12904-8-git-send-email-amirv@mellanox.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Amir Vadai March 13, 2012, 5:22 p.m. UTC
From: Amir Vadai <amirv@mellanox.co.il>

The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
"net: implement mechanism for HW based QOS" is in orientation to ETS traffic
class. This patch introduces an approach which allow to use this mechanism also
with hardware who has queues per user priority (UP). After the change,
__skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
range is defined by the caller function by the specific HW. If TC based queues,
the range is by TC number and for UP based queues, the range is by UP.

CC: John Fastabend <john.r.fastabend@intel.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |   11 ++++++++++-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      |    9 +++------
 include/linux/netdevice.h                       |   12 +++++++++++-
 include/linux/skbuff.h                          |    3 ++-
 net/core/dev.c                                  |   10 +---------
 5 files changed, 27 insertions(+), 18 deletions(-)

Comments

John Fastabend March 13, 2012, 6:23 p.m. UTC | #1
On 3/13/2012 10:22 AM, Amir Vadai wrote:
> From: Amir Vadai <amirv@mellanox.co.il>
> 
> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic
> class. This patch introduces an approach which allow to use this mechanism also
> with hardware who has queues per user priority (UP). After the change,
> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
> range is defined by the caller function by the specific HW. If TC based queues,
> the range is by TC number and for UP based queues, the range is by UP.
> 

ETS is one specific use case for mqprio it can easily be used with other
hardware transmission selection algorithms 802.1Q std based or otherwise.

The mapping is really just an skb->priority to group of queues (qoffset and
qcount). I happened to call the queue grouping a traffic class because that
aligns with 802.1Q but it _really_ is just a queue grouping.

In your case what would it mean to change the map and num_tc see 'tc':

[root@jf-dev1-dcblab netperf]# tc qdisc add dev eth3 root mqprio help
Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...]
                  [queues count1@offset1 count2@offset2 ...] [hw 1|0]

For example setting 'num_tc 8 map 0 1 2 3 0 1 2 3' looks like it might
not work correctly. Would you end up with an skb tagged with priority
4,5,6,7 being sent out UP queues 0,1,2,3? My quess is that won't work
with PFC or your ETS correctly.

In the canonical iSCSI case we put iscsid in a net_prio cgroup to get the
priority set then use the priority to steer the skb to the correct queue
groupings. In your case I think you can just fail any num_tc != 8 and keep
the dflt map 1:1 then this should work. What did I miss? It looks like you
already fail the num_tc != 8 case so why do we need this change?

At most maybe we need a flag to indicate the mqprio map can't be changed in
some cases.


> CC: John Fastabend <john.r.fastabend@intel.com>
> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> CC: Eilon Greenstein <eilong@broadcom.com>
> Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |   11 ++++++++++-
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c      |    9 +++------
>  include/linux/netdevice.h                       |   12 +++++++++++-
>  include/linux/skbuff.h                          |    3 ++-
>  net/core/dev.c                                  |   10 +---------
>  5 files changed, 27 insertions(+), 18 deletions(-)
> 

[...]

>  
>  void bnx2x_set_num_queues(struct bnx2x *bp)
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 7a49830..d0d96e3 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
>  
>  u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
>  {
> -	struct mlx4_en_priv *priv = netdev_priv(dev);
> -	int up = -1;
> +	int up = 0;
>  
>  	if (vlan_tx_tag_present(skb))
>  		up = (vlan_tx_tag_get(skb) >> 13);

I was trying to avoid logic like this in select_queue().

Can we get the same behavior by keeping the egress map and mqprio
map in sync?

>  	else if (dev->num_tc)
>  		up = netdev_get_prio_tc_map(dev, skb->priority);
>  
> -	if (up >= 0)
> -		return MLX4_EN_NUM_TX_RINGS + up;
> -
> -	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS);
> +	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS,
> +			MLX4_EN_NUM_TX_RINGS + up, 1);
>  }
>  

Thanks,
John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Vadai March 14, 2012, 10:09 a.m. UTC | #2
On 03/13/2012 08:23 PM, John Fastabend wrote:
> On 3/13/2012 10:22 AM, Amir Vadai wrote:
>> From: Amir Vadai<amirv@mellanox.co.il>
>>
>> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
>> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic
>> class. This patch introduces an approach which allow to use this mechanism also
>> with hardware who has queues per user priority (UP). After the change,
>> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
>> range is defined by the caller function by the specific HW. If TC based queues,
>> the range is by TC number and for UP based queues, the range is by UP.
>>
> ETS is one specific use case for mqprio it can easily be used with other
> hardware transmission selection algorithms 802.1Q std based or otherwise.
>
> The mapping is really just an skb->priority to group of queues (qoffset and
> qcount). I happened to call the queue grouping a traffic class because that
> aligns with 802.1Q but it _really_ is just a queue grouping.
This is good for untagged traffic, but for tagged traffic I can see 2 
problems with this approach:
1. mqprio mapping could contradict egress map or 8021Qaz mapping (UP <=> 
TC). This could be solved (not very elegantly) by forcing the mappings 
to be synced.
2. egress map is per vlan, and mqprio mapping is one global mapping.
>
> In your case what would it mean to change the map and num_tc see 'tc':
>
> [root@jf-dev1-dcblab netperf]# tc qdisc add dev eth3 root mqprio help
> Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...]
>                    [queues count1@offset1 count2@offset2 ...] [hw 1|0]
>
> For example setting 'num_tc 8 map 0 1 2 3 0 1 2 3' looks like it might
> not work correctly. Would you end up with an skb tagged with priority
> 4,5,6,7 being sent out UP queues 0,1,2,3? My quess is that won't work
> with PFC or your ETS correctly.
I don't see a problem here. For example, skb tagged with priority 5 is 
mapped to UP 1. And sent through one of the tx rings of UP 1. All the 
rings of UP 1 share the same transmission queue (schedule queue) which 
is controlled by PFC and ETS by the HW. What is the problem here?
>
> In the canonical iSCSI case we put iscsid in a net_prio cgroup to get the
> priority set then use the priority to steer the skb to the correct queue
> groupings. In your case I think you can just fail any num_tc != 8 and keep
> the dflt map 1:1 then this should work. What did I miss? It looks like you
> already fail the num_tc != 8 case so why do we need this change?
>
> At most maybe we need a flag to indicate the mqprio map can't be changed in
> some cases.
What you suggest is that the priority in net_prio cgroup will be the 
User Priority, and not just the skb priority?
And also, for tagged traffic, how could it be forced to be synced with 
egress map?
>
>
>> CC: John Fastabend<john.r.fastabend@intel.com>
>> CC: Jeff Kirsher<jeffrey.t.kirsher@intel.com>
>> CC: Eilon Greenstein<eilong@broadcom.com>
>> Signed-off-by: Amir Vadai<amirv@mellanox.co.il>
>> ---
>>   drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |   11 ++++++++++-
>>   drivers/net/ethernet/mellanox/mlx4/en_tx.c      |    9 +++------
>>   include/linux/netdevice.h                       |   12 +++++++++++-
>>   include/linux/skbuff.h                          |    3 ++-
>>   net/core/dev.c                                  |   10 +---------
>>   5 files changed, 27 insertions(+), 18 deletions(-)
>>
> [...]
>
>>
>>   void bnx2x_set_num_queues(struct bnx2x *bp)
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> index 7a49830..d0d96e3 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> @@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
>>
>>   u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
>>   {
>> -	struct mlx4_en_priv *priv = netdev_priv(dev);
>> -	int up = -1;
>> +	int up = 0;
>>
>>   	if (vlan_tx_tag_present(skb))
>>   		up = (vlan_tx_tag_get(skb)>>  13);
> I was trying to avoid logic like this in select_queue().
Why?
>
> Can we get the same behavior by keeping the egress map and mqprio
> map in sync?
As I said above, if we force egress map to be synced to mqprio mapping, 
we loose it's power - mqprio is global, and egress map is per vlan.
>
>>   	else if (dev->num_tc)
>>   		up = netdev_get_prio_tc_map(dev, skb->priority);
>>
>> -	if (up>= 0)
>> -		return MLX4_EN_NUM_TX_RINGS + up;
>> -
>> -	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS);
>> +	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS,
>> +			MLX4_EN_NUM_TX_RINGS + up, 1);
>>   }
>>
> Thanks,
> John
Thanks,
Amir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend March 14, 2012, 9:36 p.m. UTC | #3
On 3/14/2012 3:09 AM, Amir Vadai wrote:
> On 03/13/2012 08:23 PM, John Fastabend wrote:
>> On 3/13/2012 10:22 AM, Amir Vadai wrote:
>>> From: Amir Vadai<amirv@mellanox.co.il>
>>>
>>> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
>>> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic
>>> class. This patch introduces an approach which allow to use this mechanism also
>>> with hardware who has queues per user priority (UP). After the change,
>>> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
>>> range is defined by the caller function by the specific HW. If TC based queues,
>>> the range is by TC number and for UP based queues, the range is by UP.
>>>
>> ETS is one specific use case for mqprio it can easily be used with other
>> hardware transmission selection algorithms 802.1Q std based or otherwise.
>>
>> The mapping is really just an skb->priority to group of queues (qoffset and
>> qcount). I happened to call the queue grouping a traffic class because that
>> aligns with 802.1Q but it _really_ is just a queue grouping.

> This is good for untagged traffic, but for tagged traffic I can see 2 problems with this approach:
> 1. mqprio mapping could contradict egress map or 8021Qaz mapping (UP
> <=> TC). This could be solved (not very elegantly) by forcing the
> mappings to be synced. 

OK. We've just been keeping them in-sync.

> 2. egress map is per vlan, and mqprio mapping
> is one global mapping.

So it only matters when you want the egress map per vlan? The problem
I see with this now is the mellanox driver does DCB differently then
the existing drivers.

For example if I put a task in a net_prio cgroup and assign the vlan
a priority this won't actually steer the packet correctly on mlx. I
also have to create an egress map existing drivers will ignore the
egress_map and steer the skb as they always have.

At minimum skbs need to be steered the same on all drivers. We can't
expect user space to "know" what hardware is underneath.

>>
>> In your case what would it mean to change the map and num_tc see 'tc':
>>
>> [root@jf-dev1-dcblab netperf]# tc qdisc add dev eth3 root mqprio help
>> Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...]
>>                    [queues count1@offset1 count2@offset2 ...] [hw 1|0]
>>
>> For example setting 'num_tc 8 map 0 1 2 3 0 1 2 3' looks like it might
>> not work correctly. Would you end up with an skb tagged with priority
>> 4,5,6,7 being sent out UP queues 0,1,2,3? My quess is that won't work
>> with PFC or your ETS correctly.
> I don't see a problem here. For example, skb tagged with priority 5
> is mapped to UP 1. And sent through one of the tx rings of UP 1. All
> the rings of UP 1 share the same transmission queue (schedule queue)
> which is controlled by PFC and ETS by the HW. What is the problem
> here?

I was concerned about the actual tag that gets added. In ixgbe we've been
adding a tag based on skb->priority in the untagged pkt case. In your
driver after looking at the code either your not adding a tag or the
hardware is adding the correct user priority to the priority tagged pkts.

We use the skb->priority in ixgbe because we can have multiple user
priorities (PCPs) on a single tx_ring with the above map. We have
no other way to know what the priority should be in the untagged
case.

>>
>> In the canonical iSCSI case we put iscsid in a net_prio cgroup to get the
>> priority set then use the priority to steer the skb to the correct queue
>> groupings. In your case I think you can just fail any num_tc != 8 and keep
>> the dflt map 1:1 then this should work. What did I miss? It looks like you
>> already fail the num_tc != 8 case so why do we need this change?
>>
>> At most maybe we need a flag to indicate the mqprio map can't be changed in
>> some cases.
> What you suggest is that the priority in net_prio cgroup will be the User Priority, and not just the skb priority?

That is how we are using it today yes. Which creates the some what
unfortunate case (I guess) that the egress map has to be aligned
with the qdisc map. This hasn't caused any problems in practice for us.

> And also, for tagged traffic, how could it be forced to be synced with egress map?

there is a priority in net_prio.ifpriomap group for each vlan as well as
real device so we just setup the mapping for the vlan.

This is how we do things like assign a vlan a default priority.


>> [...]
>>
>>>
>>>   void bnx2x_set_num_queues(struct bnx2x *bp)
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>> index 7a49830..d0d96e3 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>> @@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
>>>
>>>   u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
>>>   {
>>> -    struct mlx4_en_priv *priv = netdev_priv(dev);
>>> -    int up = -1;
>>> +    int up = 0;
>>>
>>>       if (vlan_tx_tag_present(skb))
>>>           up = (vlan_tx_tag_get(skb)>>  13);
>> I was trying to avoid logic like this in select_queue().
> Why?

Because this makes your driver potentially behave differently then
other drivers. DCB should look the same from the user side
regardless of the driver.

>>
>> Can we get the same behavior by keeping the egress map and mqprio
>> map in sync?
> As I said above, if we force egress map to be synced to mqprio mapping, we loose it's power - mqprio is global, and egress map is per vlan.

using net_prio cgroups per vlan allows per vlan priority mappings.
I agree this is a bit awkward right now and it seems reasonable
to expect setting the egress_map causes the skb steering to work
correctly.

The crux of the issue here is that ixgbe and bnx2x are modeling
the qdisc tc as a traffic class but your hardware is based on
a model that exposes user priorities. We need these to look the
same from the user perspective. We need to figure out how to
make this correct for both models. Any suggestions?

.John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Vadai March 15, 2012, 10:05 a.m. UTC | #4
On 03/14/2012 11:36 PM, John Fastabend wrote:
> On 3/14/2012 3:09 AM, Amir Vadai wrote:
>> On 03/13/2012 08:23 PM, John Fastabend wrote:
>>> On 3/13/2012 10:22 AM, Amir Vadai wrote:
>>>> From: Amir Vadai<amirv@mellanox.co.il>
>>>>
>>>> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
>>>> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic
>>>> class. This patch introduces an approach which allow to use this mechanism also
>>>> with hardware who has queues per user priority (UP). After the change,
>>>> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
>>>> range is defined by the caller function by the specific HW. If TC based queues,
>>>> the range is by TC number and for UP based queues, the range is by UP.
>>>>
>>> ETS is one specific use case for mqprio it can easily be used with other
>>> hardware transmission selection algorithms 802.1Q std based or otherwise.
>>>
>>> The mapping is really just an skb->priority to group of queues (qoffset and
>>> qcount). I happened to call the queue grouping a traffic class because that
>>> aligns with 802.1Q but it _really_ is just a queue grouping.
>
>> This is good for untagged traffic, but for tagged traffic I can see 2 problems with this approach:
>> 1. mqprio mapping could contradict egress map or 8021Qaz mapping (UP
>> <=>  TC). This could be solved (not very elegantly) by forcing the
>> mappings to be synced.
>
> OK. We've just been keeping them in-sync.
>
>> 2. egress map is per vlan, and mqprio mapping
>> is one global mapping.
>
> So it only matters when you want the egress map per vlan? The problem
> I see with this now is the mellanox driver does DCB differently then
> the existing drivers.
>
> For example if I put a task in a net_prio cgroup and assign the vlan
> a priority this won't actually steer the packet correctly on mlx. I
> also have to create an egress map existing drivers will ignore the
> egress_map and steer the skb as they always have.
But if you don't create an egress map for tagged traffic. What will be 
in the PCP field of the vlan tag (= User Priority)?
>
> At minimum skbs need to be steered the same on all drivers. We can't
> expect user space to "know" what hardware is underneath.
>
>>>
>>> In your case what would it mean to change the map and num_tc see 'tc':
>>>
>>> [root@jf-dev1-dcblab netperf]# tc qdisc add dev eth3 root mqprio help
>>> Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...]
>>>                     [queues count1@offset1 count2@offset2 ...] [hw 1|0]
>>>
>>> For example setting 'num_tc 8 map 0 1 2 3 0 1 2 3' looks like it might
>>> not work correctly. Would you end up with an skb tagged with priority
>>> 4,5,6,7 being sent out UP queues 0,1,2,3? My quess is that won't work
>>> with PFC or your ETS correctly.
>> I don't see a problem here. For example, skb tagged with priority 5
>> is mapped to UP 1. And sent through one of the tx rings of UP 1. All
>> the rings of UP 1 share the same transmission queue (schedule queue)
>> which is controlled by PFC and ETS by the HW. What is the problem
>> here?
>
> I was concerned about the actual tag that gets added. In ixgbe we've been
> adding a tag based on skb->priority in the untagged pkt case. In your
> driver after looking at the code either your not adding a tag or the
> hardware is adding the correct user priority to the priority tagged pkts.
It is added by the HW according to the tx_ring.
>
> We use the skb->priority in ixgbe because we can have multiple user
> priorities (PCPs) on a single tx_ring with the above map. We have
> no other way to know what the priority should be in the untagged
> case.
Instead of attaching a tx_ring to ETS TC like your driver does, in our 
driver a tx_ring is attached to a single user priority (UP).
With this UP and the mapping UP <=> TC configured by DCB netlink to the 
HW, the HW can enforce the 8021Qaz attributes by the mapped TC.
For tagged traffic this UP is also used in the PCP field in the vlan tag.
>
>>>
>>> In the canonical iSCSI case we put iscsid in a net_prio cgroup to get the
>>> priority set then use the priority to steer the skb to the correct queue
>>> groupings. In your case I think you can just fail any num_tc != 8 and keep
>>> the dflt map 1:1 then this should work. What did I miss? It looks like you
>>> already fail the num_tc != 8 case so why do we need this change?
>>>
>>> At most maybe we need a flag to indicate the mqprio map can't be changed in
>>> some cases.
>> What you suggest is that the priority in net_prio cgroup will be the User Priority, and not just the skb priority?
>
> That is how we are using it today yes. Which creates the some what
> unfortunate case (I guess) that the egress map has to be aligned
> with the qdisc map. This hasn't caused any problems in practice for us.
>
>> And also, for tagged traffic, how could it be forced to be synced with egress map?
>
> there is a priority in net_prio.ifpriomap group for each vlan as well as
> real device so we just setup the mapping for the vlan.
>
> This is how we do things like assign a vlan a default priority.
>
>
>>> [...]
>>>
>>>>
>>>>    void bnx2x_set_num_queues(struct bnx2x *bp)
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>> index 7a49830..d0d96e3 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>> @@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
>>>>
>>>>    u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
>>>>    {
>>>> -    struct mlx4_en_priv *priv = netdev_priv(dev);
>>>> -    int up = -1;
>>>> +    int up = 0;
>>>>
>>>>        if (vlan_tx_tag_present(skb))
>>>>            up = (vlan_tx_tag_get(skb)>>   13);
>>> I was trying to avoid logic like this in select_queue().
>> Why?
>
> Because this makes your driver potentially behave differently then
> other drivers. DCB should look the same from the user side
> regardless of the driver.
I agree - it should look the same.
>
>>>
>>> Can we get the same behavior by keeping the egress map and mqprio
>>> map in sync?
>> As I said above, if we force egress map to be synced to mqprio mapping, we loose it's power - mqprio is global, and egress map is per vlan.
>
> using net_prio cgroups per vlan allows per vlan priority mappings.
> I agree this is a bit awkward right now and it seems reasonable
> to expect setting the egress_map causes the skb steering to work
> correctly.
>
> The crux of the issue here is that ixgbe and bnx2x are modeling
> the qdisc tc as a traffic class but your hardware is based on
> a model that exposes user priorities. We need these to look the
> same from the user perspective. We need to figure out how to
> make this correct for both models. Any suggestions?
Before suggesting, I need to make sure I understand the current model:

Assumptions
-----------
a. If tagged traffic is involved, egress map is configured 1:1 and
    therefore, skb priority = User Priority (UP)
b. mqprio traffic class is ETS TC

Mappings in use
---------------
1. net_prio cgroup: netdev + task <=> skb priority
2. SO_PRIORITY/SO_IP_TOS: skb_priority
3. mqprio: skb priority <=> traffic class
4. DCB netlink: UP <=> ETS TC

Untagged traffic
----------------
User is using [1] or [2] to tag a flow with priority.
Driver is using [3] to steer traffic according to 8021Qaz ETS attributes.

Tagged traffic
--------------
User is using [1] or [2] to tag a flow with priority
Driver is setting PCP bits in vlan header using skb priority (1:1 in 
egress map).
Traffic is steered using [3].

Mapping [4] must be synced with [3].

>
> .John


- Amir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend March 16, 2012, 7:16 a.m. UTC | #5
On 3/15/2012 3:05 AM, Amir Vadai wrote:
> On 03/14/2012 11:36 PM, John Fastabend wrote:
>> On 3/14/2012 3:09 AM, Amir Vadai wrote:
>>> On 03/13/2012 08:23 PM, John Fastabend wrote:
>>>> On 3/13/2012 10:22 AM, Amir Vadai wrote:
>>>>> From: Amir Vadai<amirv@mellanox.co.il>
>>>>>
>>>>> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
>>>>> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic
>>>>> class. This patch introduces an approach which allow to use this mechanism also
>>>>> with hardware who has queues per user priority (UP). After the change,
>>>>> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
>>>>> range is defined by the caller function by the specific HW. If TC based queues,
>>>>> the range is by TC number and for UP based queues, the range is by UP.
>>>>>
>>>> ETS is one specific use case for mqprio it can easily be used with other
>>>> hardware transmission selection algorithms 802.1Q std based or otherwise.
>>>>
>>>> The mapping is really just an skb->priority to group of queues (qoffset and
>>>> qcount). I happened to call the queue grouping a traffic class because that
>>>> aligns with 802.1Q but it _really_ is just a queue grouping.
>>
>>> This is good for untagged traffic, but for tagged traffic I can see 2 problems with this approach:
>>> 1. mqprio mapping could contradict egress map or 8021Qaz mapping (UP
>>> <=>  TC). This could be solved (not very elegantly) by forcing the
>>> mappings to be synced.
>>
>> OK. We've just been keeping them in-sync.
>>
>>> 2. egress map is per vlan, and mqprio mapping
>>> is one global mapping.
>>
>> So it only matters when you want the egress map per vlan? The problem
>> I see with this now is the mellanox driver does DCB differently then
>> the existing drivers.
>>
>> For example if I put a task in a net_prio cgroup and assign the vlan
>> a priority this won't actually steer the packet correctly on mlx. I
>> also have to create an egress map existing drivers will ignore the
>> egress_map and steer the skb as they always have.
> But if you don't create an egress map for tagged traffic. What will be in the PCP field of the vlan tag (= User Priority)?
>>
>> At minimum skbs need to be steered the same on all drivers. We can't
>> expect user space to "know" what hardware is underneath.
>>
>>>>
>>>> In your case what would it mean to change the map and num_tc see 'tc':
>>>>
>>>> [root@jf-dev1-dcblab netperf]# tc qdisc add dev eth3 root mqprio help
>>>> Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...]
>>>>                     [queues count1@offset1 count2@offset2 ...] [hw 1|0]
>>>>
>>>> For example setting 'num_tc 8 map 0 1 2 3 0 1 2 3' looks like it might
>>>> not work correctly. Would you end up with an skb tagged with priority
>>>> 4,5,6,7 being sent out UP queues 0,1,2,3? My quess is that won't work
>>>> with PFC or your ETS correctly.
>>> I don't see a problem here. For example, skb tagged with priority 5
>>> is mapped to UP 1. And sent through one of the tx rings of UP 1. All
>>> the rings of UP 1 share the same transmission queue (schedule queue)
>>> which is controlled by PFC and ETS by the HW. What is the problem
>>> here?
>>
>> I was concerned about the actual tag that gets added. In ixgbe we've been
>> adding a tag based on skb->priority in the untagged pkt case. In your
>> driver after looking at the code either your not adding a tag or the
>> hardware is adding the correct user priority to the priority tagged pkts.
> It is added by the HW according to the tx_ring.
>>
>> We use the skb->priority in ixgbe because we can have multiple user
>> priorities (PCPs) on a single tx_ring with the above map. We have
>> no other way to know what the priority should be in the untagged
>> case.
> Instead of attaching a tx_ring to ETS TC like your driver does, in our driver a tx_ring is attached to a single user priority (UP).
> With this UP and the mapping UP <=> TC configured by DCB netlink to the HW, the HW can enforce the 8021Qaz attributes by the mapped TC.
> For tagged traffic this UP is also used in the PCP field in the vlan tag.
>>
>>>>
>>>> In the canonical iSCSI case we put iscsid in a net_prio cgroup to get the
>>>> priority set then use the priority to steer the skb to the correct queue
>>>> groupings. In your case I think you can just fail any num_tc != 8 and keep
>>>> the dflt map 1:1 then this should work. What did I miss? It looks like you
>>>> already fail the num_tc != 8 case so why do we need this change?
>>>>
>>>> At most maybe we need a flag to indicate the mqprio map can't be changed in
>>>> some cases.
>>> What you suggest is that the priority in net_prio cgroup will be the User Priority, and not just the skb priority?
>>
>> That is how we are using it today yes. Which creates the some what
>> unfortunate case (I guess) that the egress map has to be aligned
>> with the qdisc map. This hasn't caused any problems in practice for us.
>>
>>> And also, for tagged traffic, how could it be forced to be synced with egress map?
>>
>> there is a priority in net_prio.ifpriomap group for each vlan as well as
>> real device so we just setup the mapping for the vlan.
>>
>> This is how we do things like assign a vlan a default priority.
>>
>>
>>>> [...]
>>>>
>>>>>
>>>>>    void bnx2x_set_num_queues(struct bnx2x *bp)
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>>> index 7a49830..d0d96e3 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>>> @@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
>>>>>
>>>>>    u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
>>>>>    {
>>>>> -    struct mlx4_en_priv *priv = netdev_priv(dev);
>>>>> -    int up = -1;
>>>>> +    int up = 0;
>>>>>
>>>>>        if (vlan_tx_tag_present(skb))
>>>>>            up = (vlan_tx_tag_get(skb)>>   13);
>>>> I was trying to avoid logic like this in select_queue().
>>> Why?
>>
>> Because this makes your driver potentially behave differently then
>> other drivers. DCB should look the same from the user side
>> regardless of the driver.
> I agree - it should look the same.
>>
>>>>
>>>> Can we get the same behavior by keeping the egress map and mqprio
>>>> map in sync?
>>> As I said above, if we force egress map to be synced to mqprio mapping, we loose it's power - mqprio is global, and egress map is per vlan.
>>
>> using net_prio cgroups per vlan allows per vlan priority mappings.
>> I agree this is a bit awkward right now and it seems reasonable
>> to expect setting the egress_map causes the skb steering to work
>> correctly.
>>
>> The crux of the issue here is that ixgbe and bnx2x are modeling
>> the qdisc tc as a traffic class but your hardware is based on
>> a model that exposes user priorities. We need these to look the
>> same from the user perspective. We need to figure out how to
>> make this correct for both models. Any suggestions?
> Before suggesting, I need to make sure I understand the current model:
> 
> Assumptions
> -----------
> a. If tagged traffic is involved, egress map is configured 1:1 and
>    therefore, skb priority = User Priority (UP)

Right this is how we currently do this.

> b. mqprio traffic class is ETS TC

For IEEE 802.1Qaz yes this is the model.

> 
> Mappings in use
> ---------------
> 1. net_prio cgroup: netdev + task <=> skb priority
> 2. SO_PRIORITY/SO_IP_TOS: skb_priority
> 3. mqprio: skb priority <=> traffic class
> 4. DCB netlink: UP <=> ETS TC
> 

yep with DCB this is how we currently do the mappings.

> Untagged traffic
> ----------------
> User is using [1] or [2] to tag a flow with priority.
> Driver is using [3] to steer traffic according to 8021Qaz ETS attributes.
> 

correct

> Tagged traffic
> --------------
> User is using [1] or [2] to tag a flow with priority
> Driver is setting PCP bits in vlan header using skb priority (1:1 in egress map).
> Traffic is steered using [3].
> 
> Mapping [4] must be synced with [3].
> 

Correct this is how it currently works. It might be worth thinking
about how to get the egress map to work correctly in these cases.
Although right now user space can keep these in sync and manage
them.

>>
>> .John
> 
> 
> - Amir

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index c11e50d..614d0b2 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -1422,6 +1422,8 @@  void bnx2x_netif_stop(struct bnx2x *bp, int disable_hw)
 u16 bnx2x_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
 	struct bnx2x *bp = netdev_priv(dev);
+	u16 qoffset = 0;
+	u16 qcount = BNX2X_NUM_ETH_QUEUES(bp);
 
 #ifdef BCM_CNIC
 	if (!NO_FCOE(bp)) {
@@ -1441,8 +1443,15 @@  u16 bnx2x_select_queue(struct net_device *dev, struct sk_buff *skb)
 			return bnx2x_fcoe_tx(bp, txq_index);
 	}
 #endif
+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+		qoffset = dev->tc_to_txq[tc].offset;
+		qcount = dev->tc_to_txq[tc].count;
+	}
+
 	/* select a non-FCoE queue */
-	return __skb_tx_hash(dev, skb, BNX2X_NUM_ETH_QUEUES(bp));
+	return __skb_tx_hash(dev, skb, BNX2X_NUM_ETH_QUEUES(bp), qoffset,
+			qcount);
 }
 
 void bnx2x_set_num_queues(struct bnx2x *bp)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 7a49830..d0d96e3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -570,18 +570,15 @@  static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 
 u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
-	struct mlx4_en_priv *priv = netdev_priv(dev);
-	int up = -1;
+	int up = 0;
 
 	if (vlan_tx_tag_present(skb))
 		up = (vlan_tx_tag_get(skb) >> 13);
 	else if (dev->num_tc)
 		up = netdev_get_prio_tc_map(dev, skb->priority);
 
-	if (up >= 0)
-		return MLX4_EN_NUM_TX_RINGS + up;
-
-	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS);
+	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS,
+			MLX4_EN_NUM_TX_RINGS + up, 1);
 }
 
 static void mlx4_bf_copy(void __iomem *dst, unsigned long *src, unsigned bytecnt)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4535a4e..952dde3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2061,7 +2061,17 @@  static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
 static inline u16 skb_tx_hash(const struct net_device *dev,
 			      const struct sk_buff *skb)
 {
-	return __skb_tx_hash(dev, skb, dev->real_num_tx_queues);
+	u16 qoffset = 0;
+	u16 qcount = dev->real_num_tx_queues;
+
+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+		qoffset = dev->tc_to_txq[tc].offset;
+		qcount = dev->tc_to_txq[tc].count;
+	}
+
+	return __skb_tx_hash(dev, skb, dev->real_num_tx_queues, qoffset,
+			qcount);
 }
 
 /**
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8dc8257..14fa201 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2455,7 +2455,8 @@  static inline bool skb_rx_queue_recorded(const struct sk_buff *skb)
 
 extern u16 __skb_tx_hash(const struct net_device *dev,
 			 const struct sk_buff *skb,
-			 unsigned int num_tx_queues);
+			 unsigned int num_tx_queues,
+			 u16 qoffset, u16 qcount);
 
 #ifdef CONFIG_XFRM
 static inline struct sec_path *skb_sec_path(struct sk_buff *skb)
diff --git a/net/core/dev.c b/net/core/dev.c
index 0090809..ecbf5c1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2290,11 +2290,9 @@  static u32 hashrnd __read_mostly;
  * to be used as a distribution range.
  */
 u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
-		  unsigned int num_tx_queues)
+		  unsigned int num_tx_queues, u16 qoffset, u16 qcount)
 {
 	u32 hash;
-	u16 qoffset = 0;
-	u16 qcount = num_tx_queues;
 
 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
@@ -2303,12 +2301,6 @@  u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		return hash;
 	}
 
-	if (dev->num_tc) {
-		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
-		qoffset = dev->tc_to_txq[tc].offset;
-		qcount = dev->tc_to_txq[tc].count;
-	}
-
 	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
 	else