diff mbox series

[net-next,1/6] net/dcb: Add dcbnl buffer attribute

Message ID 20180521210502.11082-2-saeedm@mellanox.com
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series [net-next,1/6] net/dcb: Add dcbnl buffer attribute | expand

Commit Message

Saeed Mahameed May 21, 2018, 9:04 p.m. UTC
From: Huy Nguyen <huyn@mellanox.com>

In this patch, we add dcbnl buffer attribute to allow user
change the NIC's buffer configuration such as priority
to buffer mapping and buffer size of individual buffer.

This attribute combined with pfc attribute allows advance user to
fine tune the qos setting for specific priority queue. For example,
user can give dedicated buffer for one or more prirorities or user
can give large buffer to certain priorities.

We present an use case scenario where dcbnl buffer attribute configured
by advance user helps reduce the latency of messages of different sizes.

Scenarios description:
On ConnectX-5, we run latency sensitive traffic with
small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
traffic with large messages sizes 512KB and 1MB. We group small, medium,
and large message sizes to their own pfc enables priorities as follow.
  Priorities 1 & 2 (64B, 256B and 1KB)
  Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
  Priorities 5 & 6 (512KB and 1MB)

By default, ConnectX-5 maps all pfc enabled priorities to a single
lossless fixed buffer size of 50% of total available buffer space. The
other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
we create three equal size lossless buffers. Each buffer has 25% of total
available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
to lossless  buffer mappings are set as follow.
  Priorities 1 & 2 on lossless buffer #1
  Priorities 3 & 4 on lossless buffer #2
  Priorities 5 & 6 on lossless buffer #3

We observe improvements in latency for small and medium message sizes
as follows. Please note that the large message sizes bandwidth performance is
reduced but the total bandwidth remains the same.
  256B message size (42 % latency reduction)
  4K message size (21% latency reduction)
  64K message size (16% latency reduction)

Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/net/dcbnl.h        |  4 ++++
 include/uapi/linux/dcbnl.h | 10 ++++++++++
 net/dcb/dcbnl.c            | 20 ++++++++++++++++++++
 3 files changed, 34 insertions(+)

Comments

Jakub Kicinski May 22, 2018, 5:20 a.m. UTC | #1
On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:
> From: Huy Nguyen <huyn@mellanox.com>
> 
> In this patch, we add dcbnl buffer attribute to allow user
> change the NIC's buffer configuration such as priority
> to buffer mapping and buffer size of individual buffer.
> 
> This attribute combined with pfc attribute allows advance user to
> fine tune the qos setting for specific priority queue. For example,
> user can give dedicated buffer for one or more prirorities or user
> can give large buffer to certain priorities.
> 
> We present an use case scenario where dcbnl buffer attribute configured
> by advance user helps reduce the latency of messages of different sizes.
> 
> Scenarios description:
> On ConnectX-5, we run latency sensitive traffic with
> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
> traffic with large messages sizes 512KB and 1MB. We group small, medium,
> and large message sizes to their own pfc enables priorities as follow.
>   Priorities 1 & 2 (64B, 256B and 1KB)
>   Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
>   Priorities 5 & 6 (512KB and 1MB)
> 
> By default, ConnectX-5 maps all pfc enabled priorities to a single
> lossless fixed buffer size of 50% of total available buffer space. The
> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
> we create three equal size lossless buffers. Each buffer has 25% of total
> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
> to lossless  buffer mappings are set as follow.
>   Priorities 1 & 2 on lossless buffer #1
>   Priorities 3 & 4 on lossless buffer #2
>   Priorities 5 & 6 on lossless buffer #3
> 
> We observe improvements in latency for small and medium message sizes
> as follows. Please note that the large message sizes bandwidth performance is
> reduced but the total bandwidth remains the same.
>   256B message size (42 % latency reduction)
>   4K message size (21% latency reduction)
>   64K message size (16% latency reduction)
> 
> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

On a cursory look this bares a lot of resemblance to devlink shared
buffer configuration ABI.  Did you look into using that?  

Just to be clear devlink shared buffer ABIs don't require representors
and "switchdev mode".
Huy Nguyen May 22, 2018, 3:36 p.m. UTC | #2
On 5/22/2018 12:20 AM, Jakub Kicinski wrote:
> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:
>> From: Huy Nguyen <huyn@mellanox.com>
>>
>> In this patch, we add dcbnl buffer attribute to allow user
>> change the NIC's buffer configuration such as priority
>> to buffer mapping and buffer size of individual buffer.
>>
>> This attribute combined with pfc attribute allows advance user to
>> fine tune the qos setting for specific priority queue. For example,
>> user can give dedicated buffer for one or more prirorities or user
>> can give large buffer to certain priorities.
>>
>> We present an use case scenario where dcbnl buffer attribute configured
>> by advance user helps reduce the latency of messages of different sizes.
>>
>> Scenarios description:
>> On ConnectX-5, we run latency sensitive traffic with
>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
>> and large message sizes to their own pfc enables priorities as follow.
>>    Priorities 1 & 2 (64B, 256B and 1KB)
>>    Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
>>    Priorities 5 & 6 (512KB and 1MB)
>>
>> By default, ConnectX-5 maps all pfc enabled priorities to a single
>> lossless fixed buffer size of 50% of total available buffer space. The
>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
>> we create three equal size lossless buffers. Each buffer has 25% of total
>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
>> to lossless  buffer mappings are set as follow.
>>    Priorities 1 & 2 on lossless buffer #1
>>    Priorities 3 & 4 on lossless buffer #2
>>    Priorities 5 & 6 on lossless buffer #3
>>
>> We observe improvements in latency for small and medium message sizes
>> as follows. Please note that the large message sizes bandwidth performance is
>> reduced but the total bandwidth remains the same.
>>    256B message size (42 % latency reduction)
>>    4K message size (21% latency reduction)
>>    64K message size (16% latency reduction)
>>
>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> On a cursory look this bares a lot of resemblance to devlink shared
> buffer configuration ABI.  Did you look into using that?
>
> Just to be clear devlink shared buffer ABIs don't require representors
> and "switchdev mode".
> .
[HQN] Dear Jakub, there are several reasons that devlink shared buffer 
ABI cannot be used:
1. The devlink shared buffer ABI is written based on the switch cli 
which you can find out more
from this link https://community.mellanox.com/docs/DOC-2558.
2. The dcbnl interfaces have been used for QoS settings. In NIC, the 
buffer configuration are tied to
priority (ETS PFC). The buffer configuration are not tied to port like 
switch.
3. Shared buffer, alpha, threshold are switch specific terms.

Please let me know if you have any further question.
Regards,
Huy Nguyen
Jakub Kicinski May 22, 2018, 6:32 p.m. UTC | #3
On Tue, 22 May 2018 10:36:17 -0500, Huy Nguyen wrote:
> On 5/22/2018 12:20 AM, Jakub Kicinski wrote:
> > On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:  
> >> From: Huy Nguyen <huyn@mellanox.com>
> >>
> >> In this patch, we add dcbnl buffer attribute to allow user
> >> change the NIC's buffer configuration such as priority
> >> to buffer mapping and buffer size of individual buffer.
> >>
> >> This attribute combined with pfc attribute allows advance user to
> >> fine tune the qos setting for specific priority queue. For example,
> >> user can give dedicated buffer for one or more prirorities or user
> >> can give large buffer to certain priorities.
> >>
> >> We present an use case scenario where dcbnl buffer attribute configured
> >> by advance user helps reduce the latency of messages of different sizes.
> >>
> >> Scenarios description:
> >> On ConnectX-5, we run latency sensitive traffic with
> >> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
> >> traffic with large messages sizes 512KB and 1MB. We group small, medium,
> >> and large message sizes to their own pfc enables priorities as follow.
> >>    Priorities 1 & 2 (64B, 256B and 1KB)
> >>    Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
> >>    Priorities 5 & 6 (512KB and 1MB)
> >>
> >> By default, ConnectX-5 maps all pfc enabled priorities to a single
> >> lossless fixed buffer size of 50% of total available buffer space. The
> >> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
> >> we create three equal size lossless buffers. Each buffer has 25% of total
> >> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
> >> to lossless  buffer mappings are set as follow.
> >>    Priorities 1 & 2 on lossless buffer #1
> >>    Priorities 3 & 4 on lossless buffer #2
> >>    Priorities 5 & 6 on lossless buffer #3
> >>
> >> We observe improvements in latency for small and medium message sizes
> >> as follows. Please note that the large message sizes bandwidth performance is
> >> reduced but the total bandwidth remains the same.
> >>    256B message size (42 % latency reduction)
> >>    4K message size (21% latency reduction)
> >>    64K message size (16% latency reduction)
> >>
> >> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
> >> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>  
> > On a cursory look this bares a lot of resemblance to devlink shared
> > buffer configuration ABI.  Did you look into using that?
> >
> > Just to be clear devlink shared buffer ABIs don't require representors
> > and "switchdev mode".
> > .  
> [HQN] Dear Jakub, there are several reasons that devlink shared buffer 
> ABI cannot be used:
> 1. The devlink shared buffer ABI is written based on the switch cli 
> which you can find out more
> from this link https://community.mellanox.com/docs/DOC-2558.

Devlink API accommodates requirements of simpler (SwitchX2?) and more
advanced schemes (present in Spectrum).  The simpler/basic static
threshold configurations is exactly what you are doing here, AFAIU.

> 2. The dcbnl interfaces have been used for QoS settings.

QoS settings != shared buffer configuration.

> In NIC, the  buffer configuration are tied to priority (ETS PFC).

Some customers use DCB, a lot (most?) of them don't.  I don't think the
"this is a logical extension of a commonly used API" really stands here.

> The buffer configuration are not tied to port like switch.

It's tied to a port and TCs, you just have one port but still have 8
TCs exactly like a switch...

> 3. Shared buffer, alpha, threshold are switch specific terms.

IDK how talking about alpha is relevant, it's just one threshold type
the API supports.  As far as shared buffer and threshold I don't know
if these are switch terms (or how "switch" differs from "NIC" at that
level) - I personally find carving shared buffer into pools very
intuitive.

Could you give examples of commands/configs one can use with your new
ABI?  How does one query the total size of the buffer to be carved?
Huy Nguyen May 23, 2018, 1:01 a.m. UTC | #4
Dear Jakub, PSB.

On 5/22/2018 1:32 PM, Jakub Kicinski wrote:
> On Tue, 22 May 2018 10:36:17 -0500, Huy Nguyen wrote:
>> On 5/22/2018 12:20 AM, Jakub Kicinski wrote:
>>> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:
>>>> From: Huy Nguyen <huyn@mellanox.com>
>>>>
>>>> In this patch, we add dcbnl buffer attribute to allow user
>>>> change the NIC's buffer configuration such as priority
>>>> to buffer mapping and buffer size of individual buffer.
>>>>
>>>> This attribute combined with pfc attribute allows advance user to
>>>> fine tune the qos setting for specific priority queue. For example,
>>>> user can give dedicated buffer for one or more prirorities or user
>>>> can give large buffer to certain priorities.
>>>>
>>>> We present an use case scenario where dcbnl buffer attribute configured
>>>> by advance user helps reduce the latency of messages of different sizes.
>>>>
>>>> Scenarios description:
>>>> On ConnectX-5, we run latency sensitive traffic with
>>>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
>>>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
>>>> and large message sizes to their own pfc enables priorities as follow.
>>>>     Priorities 1 & 2 (64B, 256B and 1KB)
>>>>     Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
>>>>     Priorities 5 & 6 (512KB and 1MB)
>>>>
>>>> By default, ConnectX-5 maps all pfc enabled priorities to a single
>>>> lossless fixed buffer size of 50% of total available buffer space. The
>>>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
>>>> we create three equal size lossless buffers. Each buffer has 25% of total
>>>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
>>>> to lossless  buffer mappings are set as follow.
>>>>     Priorities 1 & 2 on lossless buffer #1
>>>>     Priorities 3 & 4 on lossless buffer #2
>>>>     Priorities 5 & 6 on lossless buffer #3
>>>>
>>>> We observe improvements in latency for small and medium message sizes
>>>> as follows. Please note that the large message sizes bandwidth performance is
>>>> reduced but the total bandwidth remains the same.
>>>>     256B message size (42 % latency reduction)
>>>>     4K message size (21% latency reduction)
>>>>     64K message size (16% latency reduction)
>>>>
>>>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
>>>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>>> On a cursory look this bares a lot of resemblance to devlink shared
>>> buffer configuration ABI.  Did you look into using that?
>>>
>>> Just to be clear devlink shared buffer ABIs don't require representors
>>> and "switchdev mode".
>>> .
>> [HQN] Dear Jakub, there are several reasons that devlink shared buffer
>> ABI cannot be used:
>> 1. The devlink shared buffer ABI is written based on the switch cli
>> which you can find out more
>> from this link https://community.mellanox.com/docs/DOC-2558.
> Devlink API accommodates requirements of simpler (SwitchX2?) and more
> advanced schemes (present in Spectrum).  The simpler/basic static
> threshold configurations is exactly what you are doing here, AFAIU.
[HQN] Devlink API is tailored specifically for switch. We don't 
configure threshold configuration
explicitly. It is done via PFC. Once PFC is enabled on priority, 
threshold is setup based on our
proprietary formula that were tested rigorously for performance.
>> 2. The dcbnl interfaces have been used for QoS settings.
> QoS settings != shared buffer configuration.
[HQN] I think we have different definition about "shared buffer". Please 
refer to this below switch cli link.
It explained in detail what is the "shared buffer" in switch means.
Our NIC does not have "shared buffer" supported.
https://community.mellanox.com/docs/DOC-2591

>
>> In NIC, the  buffer configuration are tied to priority (ETS PFC).
> Some customers use DCB, a lot (most?) of them don't.  I don't think the
> "this is a logical extension of a commonly used API" really stands here.
[HQN] DCBNL are being actively used. The whole point of this patch
is to tie buffer configuration with IEEE's priority and is IEEE's PFC 
configuration.

Ambitious future is to have the switch configure the NIC's buffer size 
and buffer mapping
via TLV packet and this DCBNL interface. But we won't go too far here.
>
>> The buffer configuration are not tied to port like switch.
> It's tied to a port and TCs, you just have one port but still have 8
> TCs exactly like a switch...
[HQN] No. Our buffer ties to priority not to TCs.
>> 3. Shared buffer, alpha, threshold are switch specific terms.
> IDK how talking about alpha is relevant, it's just one threshold type
> the API supports.  As far as shared buffer and threshold I don't know
> if these are switch terms (or how "switch" differs from "NIC" at that
> level) - I personally find carving shared buffer into pools very
> intuitive.
[HQN] Yes, I understand your point too. The NIC's buffer shares some 
characteristics with the switch's
buffer settings. But this DCB buffer setting is to improve the 
performance and work together with the
PFC setting. We would like to keep all the qos setting under DCB Netlink 
as they are designed
to be this way.

>
> Could you give examples of commands/configs one can use with your new
> ABI?
[HQN] The plan is to add the support in lldptool once the kernel code is 
accepted. To test the kernel code,
I am using small python scripts that works on top of the netlink library.
It will be like this format which is similar to other options in lldptool
     priority2buffer: 0,2,5,7,1,2,3,6 maps priorities 0,1,2,3,4,5,6,7 to 
buffer 0,2,5,7,1,2,3,6
     buffer_size: 87296,87296,0,87296,0,0,0,0 set receive buffer size 
for buffer 0,1,2,3,4,5,6,7 respectively
>    How does one query the total size of the buffer to be carved?
[HQN] This is not necessary. If the total size is too big, error will be 
return via DCB netlink interface.
>
Or Gerlitz May 23, 2018, 6:15 a.m. UTC | #5
On Wed, May 23, 2018 at 4:01 AM, Huy Nguyen <huyn@mellanox.com> wrote:
> Dear Jakub, PSB.
> On 5/22/2018 1:32 PM, Jakub Kicinski wrote:

>> Devlink API accommodates requirements of simpler (SwitchX2?) and more
>> advanced schemes (present in Spectrum).  The simpler/basic static
>> threshold configurations is exactly what you are doing here, AFAIU.

> [HQN] Devlink API is tailored specifically for switch. We don't configure
> threshold configuration
> explicitly. It is done via PFC. Once PFC is enabled on priority, threshold
> is setup based on our
> proprietary formula that were tested rigorously for performance.

Huy, please do not prefix your reply lines with your name, it's not needed
and confusing, the email clients used by people in this list do the job.
Jakub Kicinski May 23, 2018, 9:23 a.m. UTC | #6
On Tue, 22 May 2018 20:01:21 -0500, Huy Nguyen wrote:
> On 5/22/2018 1:32 PM, Jakub Kicinski wrote:
> > On Tue, 22 May 2018 10:36:17 -0500, Huy Nguyen wrote:  
> >> On 5/22/2018 12:20 AM, Jakub Kicinski wrote:  
> >>> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:  
> >>>> From: Huy Nguyen <huyn@mellanox.com>
> >>>>
> >>>> In this patch, we add dcbnl buffer attribute to allow user
> >>>> change the NIC's buffer configuration such as priority
> >>>> to buffer mapping and buffer size of individual buffer.
> >>>>
> >>>> This attribute combined with pfc attribute allows advance user to
> >>>> fine tune the qos setting for specific priority queue. For example,
> >>>> user can give dedicated buffer for one or more prirorities or user
> >>>> can give large buffer to certain priorities.
> >>>>
> >>>> We present an use case scenario where dcbnl buffer attribute configured
> >>>> by advance user helps reduce the latency of messages of different sizes.
> >>>>
> >>>> Scenarios description:
> >>>> On ConnectX-5, we run latency sensitive traffic with
> >>>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
> >>>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
> >>>> and large message sizes to their own pfc enables priorities as follow.
> >>>>     Priorities 1 & 2 (64B, 256B and 1KB)
> >>>>     Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
> >>>>     Priorities 5 & 6 (512KB and 1MB)
> >>>>
> >>>> By default, ConnectX-5 maps all pfc enabled priorities to a single
> >>>> lossless fixed buffer size of 50% of total available buffer space. The
> >>>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
> >>>> we create three equal size lossless buffers. Each buffer has 25% of total
> >>>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
> >>>> to lossless  buffer mappings are set as follow.
> >>>>     Priorities 1 & 2 on lossless buffer #1
> >>>>     Priorities 3 & 4 on lossless buffer #2
> >>>>     Priorities 5 & 6 on lossless buffer #3
> >>>>
> >>>> We observe improvements in latency for small and medium message sizes
> >>>> as follows. Please note that the large message sizes bandwidth performance is
> >>>> reduced but the total bandwidth remains the same.
> >>>>     256B message size (42 % latency reduction)
> >>>>     4K message size (21% latency reduction)
> >>>>     64K message size (16% latency reduction)
> >>>>
> >>>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
> >>>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>  
> >>> On a cursory look this bares a lot of resemblance to devlink shared
> >>> buffer configuration ABI.  Did you look into using that?
> >>>
> >>> Just to be clear devlink shared buffer ABIs don't require representors
> >>> and "switchdev mode".
> >>> .  
> >> [HQN] Dear Jakub, there are several reasons that devlink shared buffer
> >> ABI cannot be used:
> >> 1. The devlink shared buffer ABI is written based on the switch cli
> >> which you can find out more
> >> from this link https://community.mellanox.com/docs/DOC-2558.  
> > Devlink API accommodates requirements of simpler (SwitchX2?) and more
> > advanced schemes (present in Spectrum).  The simpler/basic static
> > threshold configurations is exactly what you are doing here, AFAIU.  
> [HQN] Devlink API is tailored specifically for switch.

I hope that is not true, since we (Netronome) are trying to use it for
NIC configuration, too.  We should generalize the API if need be.

> We don't configure threshold configuration explicitly. It is done via
> PFC. Once PFC is enabled on priority, threshold is setup based on our
> proprietary formula that were tested rigorously for performance.

Are you referring to XOFF/XON thresholds?  I don't think the "threshold
type" in devlink API implies we are setting XON/XOFF thresholds
directly :S  If PFC is enabled we may be setting them indirectly,
obviously.

My understanding is that for static threshold type the size parameter
specifies the max amount of memory given pool can consume.

> >> 2. The dcbnl interfaces have been used for QoS settings.  
> > QoS settings != shared buffer configuration.  
> [HQN] I think we have different definition about "shared buffer".
> Please refer to this below switch cli link.
> It explained in detail what is the "shared buffer" in switch means.
> Our NIC does not have "shared buffer" supported.
> https://community.mellanox.com/docs/DOC-2591

Yes, we must have a different definitions of "shared buffer" :)  That
link, however, didn't clarify much for me...  In mlx5 you seem to have a
buffer which is shared between priorities, even if it's not what would
be referred to as shared buffer in switch context.

> >> In NIC, the  buffer configuration are tied to priority (ETS PFC).  
> > Some customers use DCB, a lot (most?) of them don't.  I don't think
> > the "this is a logical extension of a commonly used API" really
> > stands here.  
> [HQN] DCBNL are being actively used. The whole point of this patch
> is to tie buffer configuration with IEEE's priority and is IEEE's PFC 
> configuration.
>
> Ambitious future is to have the switch configure the NIC's buffer
> size and buffer mapping
> via TLV packet and this DCBNL interface. But we won't go too far here.

I think I can understand the motivation, and I think it's a nice thing
to expose!  The only questions are: does it really belong to DCBNL and
can existing API be used?
 
From patch description it seems like your default setup is shared
buffer split 50% (lossy)/50% (all prios) and the example you give
changes that to 25% (lossy)/25%x3 prio groups.

With existing devlink API could this be modelled by three ingress pools
with 2 TCs bound each?

> >> The buffer configuration are not tied to port like switch.  
> > It's tied to a port and TCs, you just have one port but still have 8
> > TCs exactly like a switch...  
> [HQN] No. Our buffer ties to priority not to TCs.

Right, that is a valid point.  Although TCs can be mapped to
priorities.  Some switches may tie buffers to priorities, too.  So
perhaps it's worth extending devlink?

> >> 3. Shared buffer, alpha, threshold are switch specific terms.  
> > IDK how talking about alpha is relevant, it's just one threshold
> > type the API supports.  As far as shared buffer and threshold I
> > don't know if these are switch terms (or how "switch" differs from
> > "NIC" at that level) - I personally find carving shared buffer into
> > pools very intuitive.  
> [HQN] Yes, I understand your point too. The NIC's buffer shares some 
> characteristics with the switch's buffer settings. 

Yes, and if it's not a perfect match we can extend it.

> But this DCB buffer setting is to improve the performance and work
> together with the PFC setting. We would like to keep all the qos
> setting under DCB Netlink as they are designed to be this way.

DCBNL seems to carry standard-based information, which this is not.
mlxsw supports DCBNL, will it also support this buffer configuration
mechanism?

> > Could you give examples of commands/configs one can use with your
> > new ABI?  
> [HQN] The plan is to add the support in lldptool once the kernel code
> is accepted. To test the kernel code,
> I am using small python scripts that works on top of the netlink
> library. It will be like this format which is similar to other
> options in lldptool priority2buffer: 0,2,5,7,1,2,3,6 maps priorities
> 0,1,2,3,4,5,6,7 to buffer 0,2,5,7,1,2,3,6
>      buffer_size: 87296,87296,0,87296,0,0,0,0 set receive buffer size 
> for buffer 0,1,2,3,4,5,6,7 respectively
> >    How does one query the total size of the buffer to be carved?  
> [HQN] This is not necessary. If the total size is too big, error will
> be return via DCB netlink interface.

Right, I'm not saying it's a bug :)  It's just nice when user can be
told the total size without having to probe for it :)
Jiri Pirko May 23, 2018, 9:33 a.m. UTC | #7
Wed, May 23, 2018 at 11:23:14AM CEST, jakub.kicinski@netronome.com wrote:
>On Tue, 22 May 2018 20:01:21 -0500, Huy Nguyen wrote:
>> On 5/22/2018 1:32 PM, Jakub Kicinski wrote:
>> > On Tue, 22 May 2018 10:36:17 -0500, Huy Nguyen wrote:  
>> >> On 5/22/2018 12:20 AM, Jakub Kicinski wrote:  
>> >>> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:  
>> >>>> From: Huy Nguyen <huyn@mellanox.com>
>> >>>>
>> >>>> In this patch, we add dcbnl buffer attribute to allow user
>> >>>> change the NIC's buffer configuration such as priority
>> >>>> to buffer mapping and buffer size of individual buffer.
>> >>>>
>> >>>> This attribute combined with pfc attribute allows advance user to
>> >>>> fine tune the qos setting for specific priority queue. For example,
>> >>>> user can give dedicated buffer for one or more prirorities or user
>> >>>> can give large buffer to certain priorities.
>> >>>>
>> >>>> We present an use case scenario where dcbnl buffer attribute configured
>> >>>> by advance user helps reduce the latency of messages of different sizes.
>> >>>>
>> >>>> Scenarios description:
>> >>>> On ConnectX-5, we run latency sensitive traffic with
>> >>>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
>> >>>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
>> >>>> and large message sizes to their own pfc enables priorities as follow.
>> >>>>     Priorities 1 & 2 (64B, 256B and 1KB)
>> >>>>     Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
>> >>>>     Priorities 5 & 6 (512KB and 1MB)
>> >>>>
>> >>>> By default, ConnectX-5 maps all pfc enabled priorities to a single
>> >>>> lossless fixed buffer size of 50% of total available buffer space. The
>> >>>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
>> >>>> we create three equal size lossless buffers. Each buffer has 25% of total
>> >>>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
>> >>>> to lossless  buffer mappings are set as follow.
>> >>>>     Priorities 1 & 2 on lossless buffer #1
>> >>>>     Priorities 3 & 4 on lossless buffer #2
>> >>>>     Priorities 5 & 6 on lossless buffer #3
>> >>>>
>> >>>> We observe improvements in latency for small and medium message sizes
>> >>>> as follows. Please note that the large message sizes bandwidth performance is
>> >>>> reduced but the total bandwidth remains the same.
>> >>>>     256B message size (42 % latency reduction)
>> >>>>     4K message size (21% latency reduction)
>> >>>>     64K message size (16% latency reduction)
>> >>>>
>> >>>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
>> >>>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>  
>> >>> On a cursory look this bares a lot of resemblance to devlink shared
>> >>> buffer configuration ABI.  Did you look into using that?
>> >>>
>> >>> Just to be clear devlink shared buffer ABIs don't require representors
>> >>> and "switchdev mode".
>> >>> .  
>> >> [HQN] Dear Jakub, there are several reasons that devlink shared buffer
>> >> ABI cannot be used:
>> >> 1. The devlink shared buffer ABI is written based on the switch cli
>> >> which you can find out more
>> >> from this link https://community.mellanox.com/docs/DOC-2558.  
>> > Devlink API accommodates requirements of simpler (SwitchX2?) and more
>> > advanced schemes (present in Spectrum).  The simpler/basic static
>> > threshold configurations is exactly what you are doing here, AFAIU.  
>> [HQN] Devlink API is tailored specifically for switch.
>
>I hope that is not true, since we (Netronome) are trying to use it for
>NIC configuration, too.  We should generalize the API if need be.

Sure it is not true. I have no clue why anyone thinks so :/


>
>> We don't configure threshold configuration explicitly. It is done via
>> PFC. Once PFC is enabled on priority, threshold is setup based on our
>> proprietary formula that were tested rigorously for performance.
>
>Are you referring to XOFF/XON thresholds?  I don't think the "threshold
>type" in devlink API implies we are setting XON/XOFF thresholds
>directly :S  If PFC is enabled we may be setting them indirectly,
>obviously.
>
>My understanding is that for static threshold type the size parameter
>specifies the max amount of memory given pool can consume.
>
>> >> 2. The dcbnl interfaces have been used for QoS settings.  
>> > QoS settings != shared buffer configuration.  
>> [HQN] I think we have different definition about "shared buffer".
>> Please refer to this below switch cli link.
>> It explained in detail what is the "shared buffer" in switch means.
>> Our NIC does not have "shared buffer" supported.
>> https://community.mellanox.com/docs/DOC-2591
>
>Yes, we must have a different definitions of "shared buffer" :)  That
>link, however, didn't clarify much for me...  In mlx5 you seem to have a
>buffer which is shared between priorities, even if it's not what would
>be referred to as shared buffer in switch context.

We introduced "shared buffer" in a devlink with "devlink handle" because
the buffer is shared among the whole ASIC, between multiple
ports/netdevs.


>
>> >> In NIC, the  buffer configuration are tied to priority (ETS PFC).  
>> > Some customers use DCB, a lot (most?) of them don't.  I don't think
>> > the "this is a logical extension of a commonly used API" really
>> > stands here.  
>> [HQN] DCBNL are being actively used. The whole point of this patch
>> is to tie buffer configuration with IEEE's priority and is IEEE's PFC 
>> configuration.
>>
>> Ambitious future is to have the switch configure the NIC's buffer
>> size and buffer mapping
>> via TLV packet and this DCBNL interface. But we won't go too far here.
>
>I think I can understand the motivation, and I think it's a nice thing
>to expose!  The only questions are: does it really belong to DCBNL and
>can existing API be used?
> 
>From patch description it seems like your default setup is shared
>buffer split 50% (lossy)/50% (all prios) and the example you give
>changes that to 25% (lossy)/25%x3 prio groups.
>
>With existing devlink API could this be modelled by three ingress pools
>with 2 TCs bound each?
>
>> >> The buffer configuration are not tied to port like switch.  
>> > It's tied to a port and TCs, you just have one port but still have 8
>> > TCs exactly like a switch...  
>> [HQN] No. Our buffer ties to priority not to TCs.
>
>Right, that is a valid point.  Although TCs can be mapped to
>priorities.  Some switches may tie buffers to priorities, too.  So
>perhaps it's worth extending devlink?
>
>> >> 3. Shared buffer, alpha, threshold are switch specific terms.  
>> > IDK how talking about alpha is relevant, it's just one threshold
>> > type the API supports.  As far as shared buffer and threshold I
>> > don't know if these are switch terms (or how "switch" differs from
>> > "NIC" at that level) - I personally find carving shared buffer into
>> > pools very intuitive.  
>> [HQN] Yes, I understand your point too. The NIC's buffer shares some 
>> characteristics with the switch's buffer settings. 
>
>Yes, and if it's not a perfect match we can extend it.
>
>> But this DCB buffer setting is to improve the performance and work
>> together with the PFC setting. We would like to keep all the qos
>> setting under DCB Netlink as they are designed to be this way.
>
>DCBNL seems to carry standard-based information, which this is not.
>mlxsw supports DCBNL, will it also support this buffer configuration
>mechanism?

Ido would provide you more and accurate info. Basically, in mlxsw we use
dcbnl for the things in can cover and was designed for. And for those
things, the netdev is a handle. Config is specific to the netdev. On the
other hand, devlink shared buffer is used for buffer shared between all
netdevs.



>
>> > Could you give examples of commands/configs one can use with your
>> > new ABI?  
>> [HQN] The plan is to add the support in lldptool once the kernel code
>> is accepted. To test the kernel code,
>> I am using small python scripts that works on top of the netlink
>> library. It will be like this format which is similar to other
>> options in lldptool priority2buffer: 0,2,5,7,1,2,3,6 maps priorities
>> 0,1,2,3,4,5,6,7 to buffer 0,2,5,7,1,2,3,6
>>      buffer_size: 87296,87296,0,87296,0,0,0,0 set receive buffer size 
>> for buffer 0,1,2,3,4,5,6,7 respectively
>> >    How does one query the total size of the buffer to be carved?  
>> [HQN] This is not necessary. If the total size is too big, error will
>> be return via DCB netlink interface.
>
>Right, I'm not saying it's a bug :)  It's just nice when user can be
>told the total size without having to probe for it :)
Jiri Pirko May 23, 2018, 9:43 a.m. UTC | #8
Tue, May 22, 2018 at 07:20:26AM CEST, jakub.kicinski@netronome.com wrote:
>On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:
>> From: Huy Nguyen <huyn@mellanox.com>
>> 
>> In this patch, we add dcbnl buffer attribute to allow user
>> change the NIC's buffer configuration such as priority
>> to buffer mapping and buffer size of individual buffer.
>> 
>> This attribute combined with pfc attribute allows advance user to
>> fine tune the qos setting for specific priority queue. For example,
>> user can give dedicated buffer for one or more prirorities or user
>> can give large buffer to certain priorities.
>> 
>> We present an use case scenario where dcbnl buffer attribute configured
>> by advance user helps reduce the latency of messages of different sizes.
>> 
>> Scenarios description:
>> On ConnectX-5, we run latency sensitive traffic with
>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
>> and large message sizes to their own pfc enables priorities as follow.
>>   Priorities 1 & 2 (64B, 256B and 1KB)
>>   Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
>>   Priorities 5 & 6 (512KB and 1MB)
>> 
>> By default, ConnectX-5 maps all pfc enabled priorities to a single
>> lossless fixed buffer size of 50% of total available buffer space. The
>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
>> we create three equal size lossless buffers. Each buffer has 25% of total
>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
>> to lossless  buffer mappings are set as follow.
>>   Priorities 1 & 2 on lossless buffer #1
>>   Priorities 3 & 4 on lossless buffer #2
>>   Priorities 5 & 6 on lossless buffer #3
>> 
>> We observe improvements in latency for small and medium message sizes
>> as follows. Please note that the large message sizes bandwidth performance is
>> reduced but the total bandwidth remains the same.
>>   256B message size (42 % latency reduction)
>>   4K message size (21% latency reduction)
>>   64K message size (16% latency reduction)
>> 
>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>
>On a cursory look this bares a lot of resemblance to devlink shared
>buffer configuration ABI.  Did you look into using that?  
>
>Just to be clear devlink shared buffer ABIs don't require representors
>and "switchdev mode".

If the CX5 buffer they are trying to utilize here is per port and not a
shared one, it would seem ok for me to not have it in "devlink sb".
John Fastabend May 23, 2018, 1:52 p.m. UTC | #9
On 05/23/2018 02:43 AM, Jiri Pirko wrote:
> Tue, May 22, 2018 at 07:20:26AM CEST, jakub.kicinski@netronome.com wrote:
>> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:
>>> From: Huy Nguyen <huyn@mellanox.com>
>>>
>>> In this patch, we add dcbnl buffer attribute to allow user
>>> change the NIC's buffer configuration such as priority
>>> to buffer mapping and buffer size of individual buffer.
>>>
>>> This attribute combined with pfc attribute allows advance user to
>>> fine tune the qos setting for specific priority queue. For example,
>>> user can give dedicated buffer for one or more prirorities or user
>>> can give large buffer to certain priorities.
>>>
>>> We present an use case scenario where dcbnl buffer attribute configured
>>> by advance user helps reduce the latency of messages of different sizes.
>>>
>>> Scenarios description:
>>> On ConnectX-5, we run latency sensitive traffic with
>>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
>>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
>>> and large message sizes to their own pfc enables priorities as follow.
>>>   Priorities 1 & 2 (64B, 256B and 1KB)
>>>   Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
>>>   Priorities 5 & 6 (512KB and 1MB)
>>>
>>> By default, ConnectX-5 maps all pfc enabled priorities to a single
>>> lossless fixed buffer size of 50% of total available buffer space. The
>>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
>>> we create three equal size lossless buffers. Each buffer has 25% of total
>>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
>>> to lossless  buffer mappings are set as follow.
>>>   Priorities 1 & 2 on lossless buffer #1
>>>   Priorities 3 & 4 on lossless buffer #2
>>>   Priorities 5 & 6 on lossless buffer #3
>>>
>>> We observe improvements in latency for small and medium message sizes
>>> as follows. Please note that the large message sizes bandwidth performance is
>>> reduced but the total bandwidth remains the same.
>>>   256B message size (42 % latency reduction)
>>>   4K message size (21% latency reduction)
>>>   64K message size (16% latency reduction)
>>>
>>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
>>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>>
>> On a cursory look this bares a lot of resemblance to devlink shared
>> buffer configuration ABI.  Did you look into using that?  
>>
>> Just to be clear devlink shared buffer ABIs don't require representors
>> and "switchdev mode".
> 
> If the CX5 buffer they are trying to utilize here is per port and not a
> shared one, it would seem ok for me to not have it in "devlink sb".
> 

+1 I think its probably reasonable to let devlink manage the global
(device layer) buffers and then have dcbnl partition the buffer up
further per netdev. Notice there is already a partitioning of the
buffers happening when DCB is enabled and/or parameters are changed.
So giving explicit control over this seems OK to me.

It would be nice though if the API gave us some hint on max/min/stride
of allowed values. Could the get API return these along with current
value? Presumably the allowed max size could change with devlink buffer
changes in how the global buffer is divided up as well.

The argument against allowing this API is it doesn't have anything to
do with the 802.1Q standard, but that is fine IMO.

.John
Huy Nguyen May 23, 2018, 3:08 p.m. UTC | #10
> I hope that is not true, since we (Netronome) are trying to use it for
> NIC configuration, too.  We should generalize the API if need be.
Yes, it is up to your company. devlink is static tool. DCBNL are intended to
be dynamically configured by switch. In real world, not many people 
configure NIC's qos
from host.
Huy Nguyen May 23, 2018, 3:27 p.m. UTC | #11
On 5/23/2018 4:23 AM, Jakub Kicinski wrote:
> >From patch description it seems like your default setup is shared
> buffer split 50% (lossy)/50% (all prios) and the example you give
> changes that to 25% (lossy)/25%x3 prio groups.
>
> With existing devlink API could this be modelled by three ingress pools
> with 2 TCs bound each?
Yes, possible. When you map prio to tc. Please be careful with prio term 
in switch
since switch has more than 8 prio.
Huy Nguyen May 23, 2018, 3:37 p.m. UTC | #12
On 5/23/2018 8:52 AM, John Fastabend wrote:
> It would be nice though if the API gave us some hint on max/min/stride
> of allowed values. Could the get API return these along with current
> value? Presumably the allowed max size could change with devlink buffer
> changes in how the global buffer is divided up as well.
Acked. I will add Max. Let's skip min/stride since it is too hardware 
specific.
John Fastabend May 23, 2018, 4:03 p.m. UTC | #13
On 05/23/2018 08:37 AM, Huy Nguyen wrote:
> 
> 
> On 5/23/2018 8:52 AM, John Fastabend wrote:
>> It would be nice though if the API gave us some hint on max/min/stride
>> of allowed values. Could the get API return these along with current
>> value? Presumably the allowed max size could change with devlink buffer
>> changes in how the global buffer is divided up as well.
> Acked. I will add Max. Let's skip min/stride since it is too hardware specific.

At minimum then we need to document for driver writers what to do
with a value that falls between strides. Round-up or round-down.

.John
Jakub Kicinski May 23, 2018, 8:13 p.m. UTC | #14
On Wed, 23 May 2018 06:52:33 -0700, John Fastabend wrote:
> On 05/23/2018 02:43 AM, Jiri Pirko wrote:
> > Tue, May 22, 2018 at 07:20:26AM CEST, jakub.kicinski@netronome.com wrote:  
> >> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:  
> >>> From: Huy Nguyen <huyn@mellanox.com>
> >>>
> >>> In this patch, we add dcbnl buffer attribute to allow user
> >>> change the NIC's buffer configuration such as priority
> >>> to buffer mapping and buffer size of individual buffer.
> >>>
> >>> This attribute combined with pfc attribute allows advance user to
> >>> fine tune the qos setting for specific priority queue. For example,
> >>> user can give dedicated buffer for one or more prirorities or user
> >>> can give large buffer to certain priorities.
> >>>
> >>> We present an use case scenario where dcbnl buffer attribute configured
> >>> by advance user helps reduce the latency of messages of different sizes.
> >>>
> >>> Scenarios description:
> >>> On ConnectX-5, we run latency sensitive traffic with
> >>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
> >>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
> >>> and large message sizes to their own pfc enables priorities as follow.
> >>>   Priorities 1 & 2 (64B, 256B and 1KB)
> >>>   Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
> >>>   Priorities 5 & 6 (512KB and 1MB)
> >>>
> >>> By default, ConnectX-5 maps all pfc enabled priorities to a single
> >>> lossless fixed buffer size of 50% of total available buffer space. The
> >>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
> >>> we create three equal size lossless buffers. Each buffer has 25% of total
> >>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
> >>> to lossless  buffer mappings are set as follow.
> >>>   Priorities 1 & 2 on lossless buffer #1
> >>>   Priorities 3 & 4 on lossless buffer #2
> >>>   Priorities 5 & 6 on lossless buffer #3
> >>>
> >>> We observe improvements in latency for small and medium message sizes
> >>> as follows. Please note that the large message sizes bandwidth performance is
> >>> reduced but the total bandwidth remains the same.
> >>>   256B message size (42 % latency reduction)
> >>>   4K message size (21% latency reduction)
> >>>   64K message size (16% latency reduction)
> >>>
> >>> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
> >>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>  
> >>
> >> On a cursory look this bares a lot of resemblance to devlink shared
> >> buffer configuration ABI.  Did you look into using that?  
> >>
> >> Just to be clear devlink shared buffer ABIs don't require representors
> >> and "switchdev mode".  
> > 
> > If the CX5 buffer they are trying to utilize here is per port and not a
> > shared one, it would seem ok for me to not have it in "devlink sb".

What I meant is that it may be shared between VFs and PF contexts.  But
if it's purely ingress per-prio FIFO without any advanced configuration
capabilities, then perhaps this API is a better match.

> +1 I think its probably reasonable to let devlink manage the global
> (device layer) buffers and then have dcbnl partition the buffer up
> further per netdev. Notice there is already a partitioning of the
> buffers happening when DCB is enabled and/or parameters are changed.
> So giving explicit control over this seems OK to me.

Okay, thanks for the discussion! :)

> It would be nice though if the API gave us some hint on max/min/stride
> of allowed values. Could the get API return these along with current
> value? Presumably the allowed max size could change with devlink
> buffer changes in how the global buffer is divided up as well.
> 
> The argument against allowing this API is it doesn't have anything to
> do with the 802.1Q standard, but that is fine IMO.
Jakub Kicinski May 23, 2018, 8:19 p.m. UTC | #15
On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:
> diff --git a/include/uapi/linux/dcbnl.h b/include/uapi/linux/dcbnl.h
> index 2c0c6453c3f4..1ddc0a44c172 100644
> --- a/include/uapi/linux/dcbnl.h
> +++ b/include/uapi/linux/dcbnl.h
> @@ -163,6 +163,15 @@ struct ieee_pfc {
>  	__u64	indications[IEEE_8021QAZ_MAX_TCS];
>  };
>  
> +#define IEEE_8021Q_MAX_PRIORITIES 8
> +#define DCBX_MAX_BUFFERS  8
> +struct dcbnl_buffer {
> +	/* priority to buffer mapping */
> +	__u8    prio2buffer[IEEE_8021Q_MAX_PRIORITIES];
> +	/* buffer size in Bytes */
> +	__u32   buffer_size[DCBX_MAX_BUFFERS];

Could you use IEEE_8021Q_MAX_PRIORITIES to size this array?  The DCBX in
the define name sort of implies this is coming from the standard which
it isn't.

> +};
> +
>  /* CEE DCBX std supported values */
>  #define CEE_DCBX_MAX_PGS	8
>  #define CEE_DCBX_MAX_PRIO	8
Jakub Kicinski May 23, 2018, 8:28 p.m. UTC | #16
On Wed, 23 May 2018 09:03:53 -0700, John Fastabend wrote:
> On 05/23/2018 08:37 AM, Huy Nguyen wrote:
> > 
> > 
> > On 5/23/2018 8:52 AM, John Fastabend wrote:  
> >> It would be nice though if the API gave us some hint on max/min/stride
> >> of allowed values. Could the get API return these along with current
> >> value? Presumably the allowed max size could change with devlink buffer
> >> changes in how the global buffer is divided up as well.  
> > Acked. I will add Max. Let's skip min/stride since it is too hardware specific.  
> 
> At minimum then we need to document for driver writers what to do
> with a value that falls between strides. Round-up or round-down.

BTW I feel like stride would be a good addition to devlink-sb, too!
Huy Nguyen May 24, 2018, 2:11 p.m. UTC | #17
On 5/23/2018 3:19 PM, Jakub Kicinski wrote:
> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:
>> diff --git a/include/uapi/linux/dcbnl.h b/include/uapi/linux/dcbnl.h
>> index 2c0c6453c3f4..1ddc0a44c172 100644
>> --- a/include/uapi/linux/dcbnl.h
>> +++ b/include/uapi/linux/dcbnl.h
>> @@ -163,6 +163,15 @@ struct ieee_pfc {
>>   	__u64	indications[IEEE_8021QAZ_MAX_TCS];
>>   };
>>   
>> +#define IEEE_8021Q_MAX_PRIORITIES 8
>> +#define DCBX_MAX_BUFFERS  8
>> +struct dcbnl_buffer {
>> +	/* priority to buffer mapping */
>> +	__u8    prio2buffer[IEEE_8021Q_MAX_PRIORITIES];
>> +	/* buffer size in Bytes */
>> +	__u32   buffer_size[DCBX_MAX_BUFFERS];
> Could you use IEEE_8021Q_MAX_PRIORITIES to size this array?  The DCBX in
> the define name sort of implies this is coming from the standard which
> it isn't.
>
I agree with your standard comment. But since priority is mapped to 
buffer, I think it is okay to reuse
#define. Let's not have a duplicate #define with the same meaning.
Huy Nguyen May 24, 2018, 2:37 p.m. UTC | #18
On 5/23/2018 11:03 AM, John Fastabend wrote:
> On 05/23/2018 08:37 AM, Huy Nguyen wrote:
>>
>> On 5/23/2018 8:52 AM, John Fastabend wrote:
>>> It would be nice though if the API gave us some hint on max/min/stride
>>> of allowed values. Could the get API return these along with current
>>> value? Presumably the allowed max size could change with devlink buffer
>>> changes in how the global buffer is divided up as well.
>> Acked. I will add Max. Let's skip min/stride since it is too hardware specific.
> At minimum then we need to document for driver writers what to do
> with a value that falls between strides. Round-up or round-down.
>
> .John
V2 still under internal review. But here are the changes in patch #1 and 
patch #6.
patch #1
Changes in V2:
     Add total_size in dcbnl_buffer to report the total available buffer 
size of the netdev.
     Code changes are in patch #1 and #6.

patch #6 commit message
Changes in V2:
     Report total available buffer size of the netdev.
     Comment on buffer stride:
     Mellanox HCA buffer stride is 128 Bytes. If the
     buffer size is not multiple of 128, the buffer size will be rounded 
down
     to the nearest multiple of 128.
Ido Schimmel May 24, 2018, 5:13 p.m. UTC | #19
Hi Jakub,

On Wed, May 23, 2018 at 02:23:14AM -0700, Jakub Kicinski wrote:
> Are you referring to XOFF/XON thresholds?  I don't think the "threshold
> type" in devlink API implies we are setting XON/XOFF thresholds
> directly :S  If PFC is enabled we may be setting them indirectly,
> obviously.
> 
> My understanding is that for static threshold type the size parameter
> specifies the max amount of memory given pool can consume.

Correct.

> Yes, we must have a different definitions of "shared buffer" :)  That
> link, however, didn't clarify much for me...  In mlx5 you seem to have a
> buffer which is shared between priorities, even if it's not what would
> be referred to as shared buffer in switch context.

The following link is my attempt at explaining the above concepts:
https://github.com/Mellanox/mlxsw/wiki/Quality-of-Service

Please let me know if something is not clear.

Basically, we use devlink-sb and dcbl to configure two different
buffers:

* devlink-sb is used to configure the switch's shared buffer which is
shared between all the ports and thus can't take a netdev as an handle

* dcbnl is used to configure per-port buffers (also called headroom
buffers) where received packets are stored while going through the
switch's pipeline before being admitted to the shared buffer and
awaiting transmission

Note that in Huy's case the buffers are of the second type (per-port)
and thus using dcbnl instead of devlink-sb makes sense.

> DCBNL seems to carry standard-based information, which this is not.
> mlxsw supports DCBNL, will it also support this buffer configuration
> mechanism?

I believe so, it's just a matter of doing the work. The hardware
supports this and the interface is identical to the NIC (same
registers).

> > >    How does one query the total size of the buffer to be carved?  
> > [HQN] This is not necessary. If the total size is too big, error will
> > be return via DCB netlink interface.
> 
> Right, I'm not saying it's a bug :)  It's just nice when user can be
> told the total size without having to probe for it :)

+1
diff mbox series

Patch

diff --git a/include/net/dcbnl.h b/include/net/dcbnl.h
index 207d9ba1f92c..0e5e91be2d30 100644
--- a/include/net/dcbnl.h
+++ b/include/net/dcbnl.h
@@ -101,6 +101,10 @@  struct dcbnl_rtnl_ops {
 	/* CEE peer */
 	int (*cee_peer_getpg) (struct net_device *, struct cee_pg *);
 	int (*cee_peer_getpfc) (struct net_device *, struct cee_pfc *);
+
+	/* buffer settings */
+	int (*dcbnl_getbuffer)(struct net_device *, struct dcbnl_buffer *);
+	int (*dcbnl_setbuffer)(struct net_device *, struct dcbnl_buffer *);
 };
 
 #endif /* __NET_DCBNL_H__ */
diff --git a/include/uapi/linux/dcbnl.h b/include/uapi/linux/dcbnl.h
index 2c0c6453c3f4..1ddc0a44c172 100644
--- a/include/uapi/linux/dcbnl.h
+++ b/include/uapi/linux/dcbnl.h
@@ -163,6 +163,15 @@  struct ieee_pfc {
 	__u64	indications[IEEE_8021QAZ_MAX_TCS];
 };
 
+#define IEEE_8021Q_MAX_PRIORITIES 8
+#define DCBX_MAX_BUFFERS  8
+struct dcbnl_buffer {
+	/* priority to buffer mapping */
+	__u8    prio2buffer[IEEE_8021Q_MAX_PRIORITIES];
+	/* buffer size in Bytes */
+	__u32   buffer_size[DCBX_MAX_BUFFERS];
+};
+
 /* CEE DCBX std supported values */
 #define CEE_DCBX_MAX_PGS	8
 #define CEE_DCBX_MAX_PRIO	8
@@ -406,6 +415,7 @@  enum ieee_attrs {
 	DCB_ATTR_IEEE_MAXRATE,
 	DCB_ATTR_IEEE_QCN,
 	DCB_ATTR_IEEE_QCN_STATS,
+	DCB_ATTR_DCB_BUFFER,
 	__DCB_ATTR_IEEE_MAX
 };
 #define DCB_ATTR_IEEE_MAX (__DCB_ATTR_IEEE_MAX - 1)
diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index bae7d78aa068..d2f4e0c1faaf 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -176,6 +176,7 @@  static const struct nla_policy dcbnl_ieee_policy[DCB_ATTR_IEEE_MAX + 1] = {
 	[DCB_ATTR_IEEE_MAXRATE]   = {.len = sizeof(struct ieee_maxrate)},
 	[DCB_ATTR_IEEE_QCN]         = {.len = sizeof(struct ieee_qcn)},
 	[DCB_ATTR_IEEE_QCN_STATS]   = {.len = sizeof(struct ieee_qcn_stats)},
+	[DCB_ATTR_DCB_BUFFER]       = {.len = sizeof(struct dcbnl_buffer)},
 };
 
 /* DCB number of traffic classes nested attributes. */
@@ -1094,6 +1095,16 @@  static int dcbnl_ieee_fill(struct sk_buff *skb, struct net_device *netdev)
 			return -EMSGSIZE;
 	}
 
+	if (ops->dcbnl_getbuffer) {
+		struct dcbnl_buffer buffer;
+
+		memset(&buffer, 0, sizeof(buffer));
+		err = ops->dcbnl_getbuffer(netdev, &buffer);
+		if (!err &&
+		    nla_put(skb, DCB_ATTR_DCB_BUFFER, sizeof(buffer), &buffer))
+			return -EMSGSIZE;
+	}
+
 	app = nla_nest_start(skb, DCB_ATTR_IEEE_APP_TABLE);
 	if (!app)
 		return -EMSGSIZE;
@@ -1453,6 +1464,15 @@  static int dcbnl_ieee_set(struct net_device *netdev, struct nlmsghdr *nlh,
 			goto err;
 	}
 
+	if (ieee[DCB_ATTR_DCB_BUFFER] && ops->dcbnl_setbuffer) {
+		struct dcbnl_buffer *buffer =
+			nla_data(ieee[DCB_ATTR_DCB_BUFFER]);
+
+		err = ops->dcbnl_setbuffer(netdev, buffer);
+		if (err)
+			goto err;
+	}
+
 	if (ieee[DCB_ATTR_IEEE_APP_TABLE]) {
 		struct nlattr *attr;
 		int rem;