diff mbox series

[net-next,V2,08/15] net/mlx5e: Add TX PTP port object support

Message ID 20201203042108.232706-9-saeedm@nvidia.com
State Not Applicable
Headers show
Series [net-next,V2,01/15] net/mlx5e: Free drop RQ in a dedicated function | expand

Commit Message

Saeed Mahameed Dec. 3, 2020, 4:21 a.m. UTC
From: Eran Ben Elisha <eranbe@nvidia.com>

Add TX PTP port object support for better TX timestamping accuracy.
Currently, driver supports CQE based TX port timestamp. Device
also offers TX port timestamp, which has less jitter and better
reflects the actual time of a packet's transmit.

Define new driver layout called ptpsq, on which driver will create
SQs that will support TX port timestamp for their transmitted packets.
Driver to identify PTP TX skbs and steer them to these dedicated SQs
as part of the select queue ndo.

Driver to hold ptpsq per TC and report them at
netif_set_real_num_tx_queues().

Add support for all needed functionality in order to xmit and poll
completions received via ptpsq.

Add ptpsq to the TX reporter recover, diagnose and dump methods.

Creation of ptpsqs is disabled by default, and can be enabled via
tx_port_ts private flag.

This patch steer all timestamp related packets to a ptpsq, but it
does not open the port timestamp support for it. The support will
be added in the following patch.

Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  29 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |   8 +
 .../net/ethernet/mellanox/mlx5/core/en/ptp.c  | 360 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/en/ptp.h  |  48 +++
 .../mellanox/mlx5/core/en/reporter_tx.c       | 166 ++++++--
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  |  33 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  82 ++--
 .../ethernet/mellanox/mlx5/core/en_stats.c    |  96 +++++
 .../ethernet/mellanox/mlx5/core/en_stats.h    |   3 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  56 ++-
 11 files changed, 823 insertions(+), 60 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/ptp.h

Comments

Jakub Kicinski Dec. 4, 2020, 2:29 a.m. UTC | #1
On Wed, 2 Dec 2020 20:21:01 -0800 Saeed Mahameed wrote:
> Add TX PTP port object support for better TX timestamping accuracy.
> Currently, driver supports CQE based TX port timestamp. Device
> also offers TX port timestamp, which has less jitter and better
> reflects the actual time of a packet's transmit.

How much better is it?

Is the new implementation is standard compliant or just a "better
guess"?

> Define new driver layout called ptpsq, on which driver will create
> SQs that will support TX port timestamp for their transmitted packets.
> Driver to identify PTP TX skbs and steer them to these dedicated SQs
> as part of the select queue ndo.
> 
> Driver to hold ptpsq per TC and report them at
> netif_set_real_num_tx_queues().
> 
> Add support for all needed functionality in order to xmit and poll
> completions received via ptpsq.
> 
> Add ptpsq to the TX reporter recover, diagnose and dump methods.
> 
> Creation of ptpsqs is disabled by default, and can be enabled via
> tx_port_ts private flag.

This flag is pretty bad user experience.

> This patch steer all timestamp related packets to a ptpsq, but it
> does not open the port timestamp support for it. The support will
> be added in the following patch.

Overall I'm a little shocked by this, let me sleep on it :)

More info on the trade offs and considerations which led to the
implementation would be useful.
Saeed Mahameed Dec. 4, 2020, 7:33 p.m. UTC | #2
On Thu, 2020-12-03 at 18:29 -0800, Jakub Kicinski wrote:
> On Wed, 2 Dec 2020 20:21:01 -0800 Saeed Mahameed wrote:
> > Add TX PTP port object support for better TX timestamping accuracy.
> > Currently, driver supports CQE based TX port timestamp. Device
> > also offers TX port timestamp, which has less jitter and better
> > reflects the actual time of a packet's transmit.
> 
> How much better is it?
> 
> Is the new implementation is standard compliant or just a "better
> guess"?
> 

It is not a guess for sure, the closer to the output port you take the
stamp the more accurate you get, this is why we need the HW timestamp
in first place, i don't have the exact number though, but we target to
be compliant with G.8273.2 class C, (30 nsec), and this code allow
Linux systems to be deployed in the 5G telco edge. Where this standard
is needed.

> > Define new driver layout called ptpsq, on which driver will create
> > SQs that will support TX port timestamp for their transmitted
> > packets.
> > Driver to identify PTP TX skbs and steer them to these dedicated
> > SQs
> > as part of the select queue ndo.
> > 
> > Driver to hold ptpsq per TC and report them at
> > netif_set_real_num_tx_queues().
> > 
> > Add support for all needed functionality in order to xmit and poll
> > completions received via ptpsq.
> > 
> > Add ptpsq to the TX reporter recover, diagnose and dump methods.
> > 
> > Creation of ptpsqs is disabled by default, and can be enabled via
> > tx_port_ts private flag.
> 
> This flag is pretty bad user experience.
> 

Yeah, nothing i  could do about this, there is a large memory foot
print i want to avoid, and we don't want to complicate PTP ctrl API of
the HW operating mode, so until we improve the HW, we prefer to keep
this feature as a private flag.

> > This patch steer all timestamp related packets to a ptpsq, but it
> > does not open the port timestamp support for it. The support will
> > be added in the following patch.
> 
> Overall I'm a little shocked by this, let me sleep on it :)
> 
> More info on the trade offs and considerations which led to the
> implementation would be useful.

To get the Improved accuracy we need a special type of SQs attached to
special HW objects that will provide more accurate stamping.

Trade-offs are :

options 1) convert ALL regular txqs (SQs) to work in this port stamping
mode.

Pros: no need for any special mode in driver, no additional memory,
other than the new HW objects we create for the special stamping.

Cons: significant performance hit for non PTP traffic, (the hw stamps
all packets in the slow but more accurate mode)

option 2) route PTP traffic to a special SQs per ring, this SQ will be
PTP port accurate, Normal traffic will continue through regular SQs

Pros: Regular non PTP traffic not affected.
Cons: High memory footprint for creating special SQs


So we prefer (2) + private flag to avoid the performance hit and the
redundant memory usage out of the box.
Jakub Kicinski Dec. 4, 2020, 8:26 p.m. UTC | #3
On Fri, 04 Dec 2020 11:33:26 -0800 Saeed Mahameed wrote:
> On Thu, 2020-12-03 at 18:29 -0800, Jakub Kicinski wrote:
> > On Wed, 2 Dec 2020 20:21:01 -0800 Saeed Mahameed wrote:  
> > > Add TX PTP port object support for better TX timestamping accuracy.
> > > Currently, driver supports CQE based TX port timestamp. Device
> > > also offers TX port timestamp, which has less jitter and better
> > > reflects the actual time of a packet's transmit.  
> > 
> > How much better is it?
> > 
> > Is the new implementation is standard compliant or just a "better
> > guess"?
> 
> It is not a guess for sure, the closer to the output port you take the
> stamp the more accurate you get, this is why we need the HW timestamp
> in first place, i don't have the exact number though, but we target to
> be compliant with G.8273.2 class C, (30 nsec), and this code allow
> Linux systems to be deployed in the 5G telco edge. Where this standard
> is needed.

I see. IIRC there was also an IEEE standard which specified the exact
time stamping point (i.e. SFD crosses layer X). If it's class C that
answers the question, I think.

> > > Define new driver layout called ptpsq, on which driver will create
> > > SQs that will support TX port timestamp for their transmitted
> > > packets.
> > > Driver to identify PTP TX skbs and steer them to these dedicated
> > > SQs
> > > as part of the select queue ndo.
> > > 
> > > Driver to hold ptpsq per TC and report them at
> > > netif_set_real_num_tx_queues().
> > > 
> > > Add support for all needed functionality in order to xmit and poll
> > > completions received via ptpsq.
> > > 
> > > Add ptpsq to the TX reporter recover, diagnose and dump methods.
> > > 
> > > Creation of ptpsqs is disabled by default, and can be enabled via
> > > tx_port_ts private flag.  
> > 
> > This flag is pretty bad user experience.
> 
> Yeah, nothing i  could do about this, there is a large memory foot
> print i want to avoid, and we don't want to complicate PTP ctrl API of
> the HW operating mode, so until we improve the HW, we prefer to keep
> this feature as a private flag.
> 
> > > This patch steer all timestamp related packets to a ptpsq, but it
> > > does not open the port timestamp support for it. The support will
> > > be added in the following patch.  
> > 
> > Overall I'm a little shocked by this, let me sleep on it :)
> > 
> > More info on the trade offs and considerations which led to the
> > implementation would be useful.  
> 
> To get the Improved accuracy we need a special type of SQs attached to
> special HW objects that will provide more accurate stamping.
> 
> Trade-offs are :
> 
> options 1) convert ALL regular txqs (SQs) to work in this port stamping
> mode.
> 
> Pros: no need for any special mode in driver, no additional memory,
> other than the new HW objects we create for the special stamping.
> 
> Cons: significant performance hit for non PTP traffic, (the hw stamps
> all packets in the slow but more accurate mode)

Just to be clear (Alexei brought this up when I mentioned these
patches) - the requirement for the separate queues is because the time
stamp enable is a queue property, not a per WQE / frame thing? I
couldn't find this in the code - could you point me to where it's set?

> option 2) route PTP traffic to a special SQs per ring, this SQ will be
> PTP port accurate, Normal traffic will continue through regular SQs
> 
> Pros: Regular non PTP traffic not affected.
> Cons: High memory footprint for creating special SQs
> 
> 
> So we prefer (2) + private flag to avoid the performance hit and the
> redundant memory usage out of the box.

Option 3 - have only one special PTP queue in the system. PTP traffic
is rather low rate, queue per core doesn't seem necessary.


Since you said the PTP queues are slower / higher overhead - are you not
concerned that QUIC traffic will get mis-directed to them? People like
hardware time stamps for all sort of measurements these days. Plus,
since UDP doesn't itself set ooo those applications may be surprised to
see increased out-of-order rate.

Why not use the PTP classification helpers we already have?
Saeed Mahameed Dec. 4, 2020, 9:57 p.m. UTC | #4
On Fri, 2020-12-04 at 12:26 -0800, Jakub Kicinski wrote:
> On Fri, 04 Dec 2020 11:33:26 -0800 Saeed Mahameed wrote:
> > On Thu, 2020-12-03 at 18:29 -0800, Jakub Kicinski wrote:
> > > On Wed, 2 Dec 2020 20:21:01 -0800 Saeed Mahameed wrote:  
> > > > Add TX PTP port object support for better TX timestamping
> > > > accuracy.
> > > > Currently, driver supports CQE based TX port timestamp. Device
> > > > also offers TX port timestamp, which has less jitter and better
> > > > reflects the actual time of a packet's transmit.  
> > > 
> > > How much better is it?
> > > 
> > > Is the new implementation is standard compliant or just a "better
> > > guess"?
> > 
> > It is not a guess for sure, the closer to the output port you take
> > the
> > stamp the more accurate you get, this is why we need the HW
> > timestamp
> > in first place, i don't have the exact number though, but we target
> > to
> > be compliant with G.8273.2 class C, (30 nsec), and this code allow
> > Linux systems to be deployed in the 5G telco edge. Where this
> > standard
> > is needed.
> 
> I see. IIRC there was also an IEEE standard which specified the exact
> time stamping point (i.e. SFD crosses layer X). If it's class C that
> answers the question, I think.
> 
> > > > Define new driver layout called ptpsq, on which driver will
> > > > create
> > > > SQs that will support TX port timestamp for their transmitted
> > > > packets.
> > > > Driver to identify PTP TX skbs and steer them to these
> > > > dedicated
> > > > SQs
> > > > as part of the select queue ndo.
> > > > 
> > > > Driver to hold ptpsq per TC and report them at
> > > > netif_set_real_num_tx_queues().
> > > > 
> > > > Add support for all needed functionality in order to xmit and
> > > > poll
> > > > completions received via ptpsq.
> > > > 
> > > > Add ptpsq to the TX reporter recover, diagnose and dump
> > > > methods.
> > > > 
> > > > Creation of ptpsqs is disabled by default, and can be enabled
> > > > via
> > > > tx_port_ts private flag.  
> > > 
> > > This flag is pretty bad user experience.
> > 
> > Yeah, nothing i  could do about this, there is a large memory foot
> > print i want to avoid, and we don't want to complicate PTP ctrl API
> > of
> > the HW operating mode, so until we improve the HW, we prefer to
> > keep
> > this feature as a private flag.
> > 
> > > > This patch steer all timestamp related packets to a ptpsq, but
> > > > it
> > > > does not open the port timestamp support for it. The support
> > > > will
> > > > be added in the following patch.  
> > > 
> > > Overall I'm a little shocked by this, let me sleep on it :)
> > > 
> > > More info on the trade offs and considerations which led to the
> > > implementation would be useful.  
> > 
> > To get the Improved accuracy we need a special type of SQs attached
> > to
> > special HW objects that will provide more accurate stamping.
> > 
> > Trade-offs are :
> > 
> > options 1) convert ALL regular txqs (SQs) to work in this port
> > stamping
> > mode.
> > 
> > Pros: no need for any special mode in driver, no additional memory,
> > other than the new HW objects we create for the special stamping.
> > 
> > Cons: significant performance hit for non PTP traffic, (the hw
> > stamps
> > all packets in the slow but more accurate mode)
> 
> Just to be clear (Alexei brought this up when I mentioned these
> patches) - the requirement for the separate queues is because the
> time
> stamp enable is a queue property, not a per WQE / frame thing? I
> couldn't find this in the code - could you point me to where it's
> set?
> 

Yes, it is not per WQE, a new SQ property and we set it on:
mlx5e_ptp_open_txqsq() and then pass it to mlx5e_create_sq()

where we set it in the hw context like so:

MLX5_SET(sqc,  sqc, ts_cqe_to_dest_cqn, csp->ts_cqe_to_dest_cqn);

A nice quirk ! this will be Line #1234 in mlx5/core/en_main.c :)


> > option 2) route PTP traffic to a special SQs per ring, this SQ will
> > be
> > PTP port accurate, Normal traffic will continue through regular SQs
> > 
> > Pros: Regular non PTP traffic not affected.
> > Cons: High memory footprint for creating special SQs
> > 
> > 
> > So we prefer (2) + private flag to avoid the performance hit and
> > the
> > redundant memory usage out of the box.
> 
> Option 3 - have only one special PTP queue in the system. PTP traffic
> is rather low rate, queue per core doesn't seem necessary.
> 

We only forward ptp traffic to the new special queue but we create more
than one to avoid internal locking as we will utilize the tx softirq
percpu.

After double checking the code it seems Eran and Tariq have decided to
forward all UDP traffic, let me double check with them what happened
here.


> 
> Since you said the PTP queues are slower / higher overhead - are you
> not
> concerned that QUIC traffic will get mis-directed to them? People
> like
> hardware time stamps for all sort of measurements these days. Plus,
> since UDP doesn't itself set ooo those applications may be surprised
> to
> see increased out-of-order rate.
> 

Right, i thought Eran was looking for the ptp udp port as well.
Let me verify what happened here.

> Why not use the PTP classification helpers we already have?

do you mean ptp_parse_header() or the ebpf prog ?
We use skb_flow_dissect() which should be simple enough.
Jakub Kicinski Dec. 4, 2020, 10:52 p.m. UTC | #5
On Fri, 04 Dec 2020 13:57:49 -0800 Saeed Mahameed wrote:
> > Why not use the PTP classification helpers we already have?  
> 
> do you mean ptp_parse_header() or the ebpf prog ?
> We use skb_flow_dissect() which should be simple enough.

Not sure which exact one TBH, I just know we have helpers for this, 
so if we don't use them it'd be good to at least justify why.

Maybe someone with more practical knowledge here can chime in with 
a recommendation for a helper to find PTP frames on TX?
Jakub Kicinski Dec. 4, 2020, 11:17 p.m. UTC | #6
On Fri, 04 Dec 2020 13:57:49 -0800 Saeed Mahameed wrote:
> > > option 2) route PTP traffic to a special SQs per ring, this SQ will
> > > be
> > > PTP port accurate, Normal traffic will continue through regular SQs
> > > 
> > > Pros: Regular non PTP traffic not affected.
> > > Cons: High memory footprint for creating special SQs
> > > 
> > > So we prefer (2) + private flag to avoid the performance hit and
> > > the
> > > redundant memory usage out of the box.  
> > 
> > Option 3 - have only one special PTP queue in the system. PTP traffic
> > is rather low rate, queue per core doesn't seem necessary.
> 
> We only forward ptp traffic to the new special queue but we create more
> than one to avoid internal locking as we will utilize the tx softirq
> percpu.

In other words to make the driver implementation simpler we'll have
a pretty basic feature hidden behind a ethtool priv knob and a number
of queues which doesn't match reality reported to user space. Hm.
Saeed Mahameed Dec. 4, 2020, 11:57 p.m. UTC | #7
On Fri, 2020-12-04 at 15:17 -0800, Jakub Kicinski wrote:
> On Fri, 04 Dec 2020 13:57:49 -0800 Saeed Mahameed wrote:
> > > > option 2) route PTP traffic to a special SQs per ring, this SQ
> > > > will
> > > > be
> > > > PTP port accurate, Normal traffic will continue through regular
> > > > SQs
> > > > 
> > > > Pros: Regular non PTP traffic not affected.
> > > > Cons: High memory footprint for creating special SQs
> > > > 
> > > > So we prefer (2) + private flag to avoid the performance hit
> > > > and
> > > > the
> > > > redundant memory usage out of the box.  
> > > 
> > > Option 3 - have only one special PTP queue in the system. PTP
> > > traffic
> > > is rather low rate, queue per core doesn't seem necessary.
> > 
> > We only forward ptp traffic to the new special queue but we create
> > more
> > than one to avoid internal locking as we will utilize the tx
> > softirq
> > percpu.
> 
> In other words to make the driver implementation simpler we'll have
> a pretty basic feature hidden behind a ethtool priv knob and a number
> of queues which doesn't match reality reported to user space. Hm.

I look at these queues as a special HW objects to allow the accurate
PTP stamping, they piggyback on the reported txqs, so they are
transparent, they just increase the memory footprint of each ring.

for the priv flags, one of the floating ideas was to
use hwtstamp_rx_filters flags:
 
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/net_tstamp.h#L107

Our hardware timestamps all packets for free whether you request it or
not, Currently there is no option to setup "ALL_PTP" traffic in ethtool
-T, but we can add this flag as it make sense to be in ethtool -T, thus
we could use it in mlx5 to determine if user selected ALL_PTP, then ptp
packets will go through this accurate special path.

This is not a W/A or an abuse to the new flag, it just means if you
select ALL_PTP then a side effect will be our HW will be more accurate 
for PTP traffic.

What do you think ?

Regarding reducing to a single special queue, i will discuss with Eran
and the Team on Sunday.

Thanks,
Saeed.
Jakub Kicinski Dec. 5, 2020, 12:24 a.m. UTC | #8
On Fri, 04 Dec 2020 15:57:36 -0800 Saeed Mahameed wrote:
> On Fri, 2020-12-04 at 15:17 -0800, Jakub Kicinski wrote:
> > On Fri, 04 Dec 2020 13:57:49 -0800 Saeed Mahameed wrote:  
> > > > > option 2) route PTP traffic to a special SQs per ring, this SQ
> > > > > will
> > > > > be
> > > > > PTP port accurate, Normal traffic will continue through regular
> > > > > SQs
> > > > > 
> > > > > Pros: Regular non PTP traffic not affected.
> > > > > Cons: High memory footprint for creating special SQs
> > > > > 
> > > > > So we prefer (2) + private flag to avoid the performance hit
> > > > > and
> > > > > the
> > > > > redundant memory usage out of the box.    
> > > > 
> > > > Option 3 - have only one special PTP queue in the system. PTP
> > > > traffic
> > > > is rather low rate, queue per core doesn't seem necessary.  
> > > 
> > > We only forward ptp traffic to the new special queue but we create
> > > more
> > > than one to avoid internal locking as we will utilize the tx
> > > softirq
> > > percpu.  
> > 
> > In other words to make the driver implementation simpler we'll have
> > a pretty basic feature hidden behind a ethtool priv knob and a number
> > of queues which doesn't match reality reported to user space. Hm.  
> 
> I look at these queues as a special HW objects to allow the accurate
> PTP stamping, they piggyback on the reported txqs, so they are
> transparent, 

But they are visible to the stack, via sysfs, netlink. Any check
in the kernel that tries to help the driver by validating user input
against real_num_tx_queues will be moot for mlx5e.

mlx5e hides the AF_XDP queues behind normal RSS queues, but it would
have extra visible queues for TX PTP.

> they just increase the memory footprint of each ring.

For every ring or for every TC? (which is hopefully 1 in any non-DCB
deployment?)

> for the priv flags, one of the floating ideas was to
> use hwtstamp_rx_filters flags:
>  
> https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/net_tstamp.h#L107
> 
> Our hardware timestamps all packets for free whether you request it or
> not, Currently there is no option to setup "ALL_PTP" traffic in ethtool
> -T, but we can add this flag as it make sense to be in ethtool -T, thus
> we could use it in mlx5 to determine if user selected ALL_PTP, then ptp
> packets will go through this accurate special path.
> 
> This is not a W/A or an abuse to the new flag, it just means if you
> select ALL_PTP then a side effect will be our HW will be more accurate 
> for PTP traffic.
> 
> What do you think ?

That sounds much better than the priv flag, yes.

> Regarding reducing to a single special queue, i will discuss with Eran
> and the Team on Sunday.

Okay, thanks.
Vladimir Oltean Dec. 5, 2020, 12:55 a.m. UTC | #9
Hi Jakub,

On Fri, Dec 04, 2020 at 02:52:40PM -0800, Jakub Kicinski wrote:
> On Fri, 04 Dec 2020 13:57:49 -0800 Saeed Mahameed wrote:
> > > Why not use the PTP classification helpers we already have?
> >
> > do you mean ptp_parse_header() or the ebpf prog ?
> > We use skb_flow_dissect() which should be simple enough.
>
> Not sure which exact one TBH, I just know we have helpers for this,
> so if we don't use them it'd be good to at least justify why.
>
> Maybe someone with more practical knowledge here can chime in with
> a recommendation for a helper to find PTP frames on TX?

ptp_classify_raw is optimized to identify PTP event messages (the only
ones that need to be timestamped as far as the protocol is concerned).
PTP general messages (Follow-Up, Delay_Resp, Announce etc) will return
PTP_CLASS_NONE from ptp_classify_raw.

But maybe there is an even better way, since this is on the TX path,
maybe the .ndo_select_queue operation can simply look at
	skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP
when deciding whether to send it to the "good" queue or not. This has
the advantage of being less expensive than any sort of frame classification.

Nonetheless, some tests would need to be run. In theory, practice and
theory are the same, whereas in practice they aren't.
Vladimir Oltean Dec. 5, 2020, 1:49 a.m. UTC | #10
On Fri, Dec 04, 2020 at 12:26:13PM -0800, Jakub Kicinski wrote:
> On Fri, 04 Dec 2020 11:33:26 -0800 Saeed Mahameed wrote:
> > On Thu, 2020-12-03 at 18:29 -0800, Jakub Kicinski wrote:
> > > On Wed, 2 Dec 2020 20:21:01 -0800 Saeed Mahameed wrote:
> > > > Add TX PTP port object support for better TX timestamping accuracy.
> > > > Currently, driver supports CQE based TX port timestamp. Device
> > > > also offers TX port timestamp, which has less jitter and better
> > > > reflects the actual time of a packet's transmit.
> > >
> > > How much better is it?
> > >
> > > Is the new implementation is standard compliant or just a "better
> > > guess"?
> >
> > It is not a guess for sure, the closer to the output port you take the
> > stamp the more accurate you get, this is why we need the HW timestamp
> > in first place, i don't have the exact number though, but we target to
> > be compliant with G.8273.2 class C, (30 nsec), and this code allow
> > Linux systems to be deployed in the 5G telco edge. Where this standard
> > is needed.
>
> I see. IIRC there was also an IEEE standard which specified the exact
> time stamping point (i.e. SFD crosses layer X). If it's class C that
> answers the question, I think.

The ITU-T G.8273.2 specification just requires a Class C clock to have a
maximum absolute time error under steady state of 30 ns. And taking
timestamps closer to the wire eliminates some clock domain crossings
from what is measured in the path delay, this is probably the reason why
timestamping is more accurate, and it helps to achieve the required
jitter figure.

The IEEE standard that you're thinking of is clause "7.3.4 Generation of
event message timestamps" of IEEE 1588.

-----------------------------[cut here]-----------------------------
7.3.4.1 Event message timestamp point

Unless otherwise specified in a transport-specific annex to this
standard, the message timestamp point for an event message shall be the
beginning of the first symbol after the Start of Frame (SOF) delimiter.

7.3.4.2 Event timestamp generation

All PTP event messages are timestamped on egress and ingress. The
timestamp shall be the time at which the event message timestamp point
passes the reference plane marking the boundary between the PTP node and
the network.

NOTE 1— If an implementation generates event message timestamps using a
point other than the message timestamp point, then the generated
timestamps should be appropriately corrected by the time interval
between the actual time of detection and the time the message timestamp
point passed the reference plane. Failure to make these corrections
results in a time offset between the slave and master clocks.
-----------------------------[cut here]-----------------------------

So there you go, it just says "the reference plane marking the boundary
between the PTP node and the network". So it depends on how you draw the
borders. I cannot seem to find any more precise definition.

Regardless of the layer at which the timestamp is taken, it is the
jitter that matters more than the reduced path delay. The latter is just
a side effect.

"How much better" is an interesting question though.
Jakub Kicinski Dec. 5, 2020, 2:10 a.m. UTC | #11
On Sat, 5 Dec 2020 03:49:27 +0200 Vladimir Oltean wrote:
> So there you go, it just says "the reference plane marking the boundary
> between the PTP node and the network". So it depends on how you draw the
> borders. I cannot seem to find any more precise definition.

Ah, you made me go search :)

I was referring to what's now section 90 of IEEE 802.3-2018.

> Regardless of the layer at which the timestamp is taken, it is the
> jitter that matters more than the reduced path delay. The latter is just
> a side effect.
Richard Cochran Dec. 5, 2020, 1:20 p.m. UTC | #12
On Sat, Dec 05, 2020 at 03:49:27AM +0200, Vladimir Oltean wrote:
> So there you go, it just says "the reference plane marking the boundary
> between the PTP node and the network". So it depends on how you draw the
> borders.

It depends on the physical link technology.  You can't just "draw the
borders" anywhere you like!

Thanks,
Richard
Eran Ben Elisha Dec. 6, 2020, 1:33 p.m. UTC | #13
On 12/4/2020 11:57 PM, Saeed Mahameed wrote:
> We only forward ptp traffic to the new special queue but we create more
> than one to avoid internal locking as we will utilize the tx softirq
> percpu.
> 
> After double checking the code it seems Eran and Tariq have decided to
> forward all UDP traffic, let me double check with them what happened
> here.

We though about extending the support of these queues to UDP in general, 
and not just PTP. But we can role this back to PTP time critical events 
on dport 319 only.
Eran Ben Elisha Dec. 6, 2020, 1:36 p.m. UTC | #14
On 12/5/2020 1:17 AM, Jakub Kicinski wrote:
>> We only forward ptp traffic to the new special queue but we create more
>> than one to avoid internal locking as we will utilize the tx softirq
>> percpu.
> In other words to make the driver implementation simpler we'll have
> a pretty basic feature hidden behind a ethtool priv knob and a number
> of queues which doesn't match reality reported to user space. Hm.

We are not hiding these queues from the netdev stack. We report them in 
real num of TX queues and manage them as any other queue. The only 
change is that select_queue() will select a stream to them if and only 
if they match specific criteria.
Eran Ben Elisha Dec. 6, 2020, 1:37 p.m. UTC | #15
On 12/5/2020 2:24 AM, Jakub Kicinski wrote:
> On Fri, 04 Dec 2020 15:57:36 -0800 Saeed Mahameed wrote:
>> On Fri, 2020-12-04 at 15:17 -0800, Jakub Kicinski wrote:
>>> On Fri, 04 Dec 2020 13:57:49 -0800 Saeed Mahameed wrote:
>>>>>> option 2) route PTP traffic to a special SQs per ring, this SQ
>>>>>> will
>>>>>> be
>>>>>> PTP port accurate, Normal traffic will continue through regular
>>>>>> SQs
>>>>>>
>>>>>> Pros: Regular non PTP traffic not affected.
>>>>>> Cons: High memory footprint for creating special SQs
>>>>>>
>>>>>> So we prefer (2) + private flag to avoid the performance hit
>>>>>> and
>>>>>> the
>>>>>> redundant memory usage out of the box.
>>>>>
>>>>> Option 3 - have only one special PTP queue in the system. PTP
>>>>> traffic
>>>>> is rather low rate, queue per core doesn't seem necessary.
>>>>
>>>> We only forward ptp traffic to the new special queue but we create
>>>> more
>>>> than one to avoid internal locking as we will utilize the tx
>>>> softirq
>>>> percpu.
>>>
>>> In other words to make the driver implementation simpler we'll have
>>> a pretty basic feature hidden behind a ethtool priv knob and a number
>>> of queues which doesn't match reality reported to user space. Hm.
>>
>> I look at these queues as a special HW objects to allow the accurate
>> PTP stamping, they piggyback on the reported txqs, so they are
>> transparent,
> 
> But they are visible to the stack, via sysfs, netlink. Any check
> in the kernel that tries to help the driver by validating user input
> against real_num_tx_queues will be moot for mlx5e.

Re-writing it here,  we report them in real num of TX queues.

> 
> mlx5e hides the AF_XDP queues behind normal RSS queues, but it would
> have extra visible queues for TX PTP.
> 
>> they just increase the memory footprint of each ring.
> 
> For every ring or for every TC? (which is hopefully 1 in any non-DCB
> deployment?)

For every TC, not for every ring.

> 
>> for the priv flags, one of the floating ideas was to
>> use hwtstamp_rx_filters flags:
>>   
>> https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/net_tstamp.h#L107
>>
>> Our hardware timestamps all packets for free whether you request it or
>> not, Currently there is no option to setup "ALL_PTP" traffic in ethtool
>> -T, but we can add this flag as it make sense to be in ethtool -T, thus
>> we could use it in mlx5 to determine if user selected ALL_PTP, then ptp
>> packets will go through this accurate special path.
>>
>> This is not a W/A or an abuse to the new flag, it just means if you
>> select ALL_PTP then a side effect will be our HW will be more accurate
>> for PTP traffic.
>>
>> What do you think ?
> 
> That sounds much better than the priv flag, yes.

Our Hardware can provide a better accurate time stamp under few 
limitations. It requires higher memory consumption ({SQ, 2 x CQ, 
internal HW LB RQ} per TC), and also has performance impact (more CQEs 
to consume for example).
Some customers are happy with the accuracy they get today and don't want 
the extra penalty, so they don't want to be automatically shifted to the 
new TS logic.

Adding new enum to the ioctl means we have add 
(HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY for example) all the way - drivers, 
kernel ptp, user space ptp, ethtool.

My concerns are:
1. Timestamp applications (like ptp4l or similar) will have to add 
support for configuring the driver to use 
HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY if supported via ioctl prior to 
packets transmit. From application point of view, the dual-modes 
(HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY , HWTSTAMP_TX_ON) support is 
redundant, as it offers nothing new.
2. Other vendors will have to support it as well, when not sure what is 
the expectation from them if they cannot improve accuracy between them.

This feature is just an internal enhancement, and as such it should be 
added only as a vendor private configuration flag. We are not offering 
here about any standard for others to follow.

If we did not have the limitation above, it could have been added as the 
default silently.

I suggest we reconsider the ethtool private-flag, the ioctl change might 
be a long journey in a wrong direction.

> 
>> Regarding reducing to a single special queue, i will discuss with Eran
>> and the Team on Sunday.
> 
> Okay, thanks.
>
Richard Cochran Dec. 6, 2020, 5:08 p.m. UTC | #16
On Sun, Dec 06, 2020 at 03:37:47PM +0200, Eran Ben Elisha wrote:
> Adding new enum to the ioctl means we have add
> (HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY for example) all the way - drivers,
> kernel ptp, user space ptp, ethtool.
> 
> My concerns are:
> 1. Timestamp applications (like ptp4l or similar) will have to add support
> for configuring the driver to use HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY if
> supported via ioctl prior to packets transmit. From application point of
> view, the dual-modes (HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY , HWTSTAMP_TX_ON)
> support is redundant, as it offers nothing new.

Well said.

> 2. Other vendors will have to support it as well, when not sure what is the
> expectation from them if they cannot improve accuracy between them.

If there were multiple different devices out there with this kind of
implementation (different levels of accuracy with increasing run time
performance cost), then we could consider such a flag.  However, to my
knowledge, this feature is unique to your device.

> This feature is just an internal enhancement, and as such it should be added
> only as a vendor private configuration flag. We are not offering here about
> any standard for others to follow.

+1

Thanks,
Richard
Saeed Mahameed Dec. 7, 2020, 5:50 a.m. UTC | #17
On Sat, 2020-12-05 at 03:49 +0200, Vladimir Oltean wrote:
> On Fri, Dec 04, 2020 at 12:26:13PM -0800, Jakub Kicinski wrote:
> > On Fri, 04 Dec 2020 11:33:26 -0800 Saeed Mahameed wrote:
> > > On Thu, 2020-12-03 at 18:29 -0800, Jakub Kicinski wrote:
> > > > On Wed, 2 Dec 2020 20:21:01 -0800 Saeed Mahameed wrote:
> > > > > Add TX PTP port object support for better TX timestamping
> > > > > accuracy.
> > > > > Currently, driver supports CQE based TX port timestamp.
> > > > > Device
> > > > > also offers TX port timestamp, which has less jitter and
> > > > > better
> > > > > reflects the actual time of a packet's transmit.
> > > > 
> > > > How much better is it?
> > > > 
> > > > Is the new implementation is standard compliant or just a
> > > > "better
> > > > guess"?
> > > 
> > > It is not a guess for sure, the closer to the output port you
> > > take the
> > > stamp the more accurate you get, this is why we need the HW
> > > timestamp
> > > in first place, i don't have the exact number though, but we
> > > target to
> > > be compliant with G.8273.2 class C, (30 nsec), and this code
> > > allow
> > > Linux systems to be deployed in the 5G telco edge. Where this
> > > standard
> > > is needed.
> > 
> > I see. IIRC there was also an IEEE standard which specified the
> > exact
> > time stamping point (i.e. SFD crosses layer X). If it's class C
> > that
> > answers the question, I think.
> 
> The ITU-T G.8273.2 specification just requires a Class C clock to
> have a
> maximum absolute time error under steady state of 30 ns. And taking
> timestamps closer to the wire eliminates some clock domain crossings
> from what is measured in the path delay, this is probably the reason
> why
> timestamping is more accurate, and it helps to achieve the required
> jitter figure.
> 
> The IEEE standard that you're thinking of is clause "7.3.4 Generation
> of
> event message timestamps" of IEEE 1588.
> 
> -----------------------------[cut here]-----------------------------
> 7.3.4.1 Event message timestamp point
> 
> Unless otherwise specified in a transport-specific annex to this
> standard, the message timestamp point for an event message shall be
> the
> beginning of the first symbol after the Start of Frame (SOF)
> delimiter.
> 
> 7.3.4.2 Event timestamp generation
> 
> All PTP event messages are timestamped on egress and ingress. The
> timestamp shall be the time at which the event message timestamp
> point
> passes the reference plane marking the boundary between the PTP node
> and
> the network.
> 
> NOTE 1— If an implementation generates event message timestamps using
> a
> point other than the message timestamp point, then the generated
> timestamps should be appropriately corrected by the time interval
> between the actual time of detection and the time the message
> timestamp
> point passed the reference plane. Failure to make these corrections
> results in a time offset between the slave and master clocks.
> -----------------------------[cut here]-----------------------------
> 
> So there you go, it just says "the reference plane marking the
> boundary
> between the PTP node and the network". So it depends on how you draw
> the
> borders. I cannot seem to find any more precise definition.
> 
> Regardless of the layer at which the timestamp is taken, it is the
> jitter that matters more than the reduced path delay. The latter is
> just
> a side effect.
> 

SO the closer to the wire you take the stamp the less potential for
jitter, since this is after ALL HW pipeline variable delays.

> "How much better" is an interesting question though.
Saeed Mahameed Dec. 7, 2020, 6:22 a.m. UTC | #18
On Sat, 2020-12-05 at 00:55 +0000, Vladimir Oltean wrote:
> Hi Jakub,
> 
> On Fri, Dec 04, 2020 at 02:52:40PM -0800, Jakub Kicinski wrote:
> > On Fri, 04 Dec 2020 13:57:49 -0800 Saeed Mahameed wrote:
> > > > Why not use the PTP classification helpers we already have?
> > > 
> > > do you mean ptp_parse_header() or the ebpf prog ?
> > > We use skb_flow_dissect() which should be simple enough.
> > 
> > Not sure which exact one TBH, I just know we have helpers for this,
> > so if we don't use them it'd be good to at least justify why.
> > 
> > Maybe someone with more practical knowledge here can chime in with
> > a recommendation for a helper to find PTP frames on TX?
> 
> ptp_classify_raw is optimized to identify PTP event messages (the
> only
> ones that need to be timestamped as far as the protocol is
> concerned).
> PTP general messages (Follow-Up, Delay_Resp, Announce etc) will
> return
> PTP_CLASS_NONE from ptp_classify_raw.
> 

I looked at the implementation, while it is nice to see that it is
running an ebpf program, but it seems these functions are meant for
those who care about the content of those PTP messages.

Select queue has to be consistent for a specific stream so
I'd rather lookup the well known ptp port via the standard flow
dissector and select the queue accordingly, using any other mechanism
might cause inconsistencies and ooo.

also the flow dissector handles non linear skbs very nicely, whereas,
the two ptp classifier methods don't. They actually have different
purposes than what we are looking for.

so I think we should stick with our simple flow dissector
implementation.

> But maybe there is an even better way, since this is on the TX path,
> maybe the .ndo_select_queue operation can simply look at
> 	skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP
> when deciding whether to send it to the "good" queue or not. This has
> the advantage of being less expensive than any sort of frame
> classification.
> 

We also considered this, this is bad in our case because this will
easily break performance for users who do setsockopt(SO_TIMESTAMPING)
on TCP/UDP sockets that favor performance over precision but still want
HW timestamping.

> Nonetheless, some tests would need to be run. In theory, practice and
> theory are the same, whereas in practice they aren't.

In Theory, I don't agree ;-).
Saeed Mahameed Dec. 7, 2020, 8:37 a.m. UTC | #19
On Sun, 2020-12-06 at 09:08 -0800, Richard Cochran wrote:
> On Sun, Dec 06, 2020 at 03:37:47PM +0200, Eran Ben Elisha wrote:
> > Adding new enum to the ioctl means we have add
> > (HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY for example) all the way -
> > drivers,
> > kernel ptp, user space ptp, ethtool.
> > 

Not exactly,
1) the flag name should be HWTSTAMP_TX_PTP_EVENTS, similar to what we
already have in RX, which will mean: 
HW stamp all PTP events, don't care about the rest.

2) no need to add it to drivers from the get go, only drivers who are
interested may implement it, and i am sure there are tons who would
like to have this flag if their hw timestamping implementation is slow
! other drivers will just keep doing what they are doing, timestamp all
traffic even if user requested this flag, again exactly like many other
drivers do for RX flags (hwtstamp_rx_filters).

> > My concerns are:
> > 1. Timestamp applications (like ptp4l or similar) will have to add
> > support
> > for configuring the driver to use HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY
> > if
> > supported via ioctl prior to packets transmit. From application
> > point of
> > view, the dual-modes (HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY ,
> > HWTSTAMP_TX_ON)
> > support is redundant, as it offers nothing new.
> 
> Well said.
> 

disagree, it is not a dual mode, just allow the user to have better
granularity for what hw stamps, exactly like what we have in rx.

we are not adding any new mechanism.

> > 2. Other vendors will have to support it as well, when not sure
> > what is the
> > expectation from them if they cannot improve accuracy between them.
> 
> If there were multiple different devices out there with this kind of
> implementation (different levels of accuracy with increasing run time
> performance cost), then we could consider such a flag.  However, to
> my
> knowledge, this feature is unique to your device.
> 

I agree, but i never meant to have a flag that indicate two different
levels of accuracy, that would be a very wild mistake for sure! 

The new flag will be about selecting granularity of what gets a hw
stamp and what doesn't, aligning with the RX filter API.

> > This feature is just an internal enhancement, and as such it should
> > be added
> > only as a vendor private configuration flag. We are not offering
> > here about
> > any standard for others to follow.
> 
> +1
> 

Our driver feature is and internal enhancement yes, but the suggested
flag is very far from indicating any internal enhancement, is actually
an enhancement to the current API, and is a very simple extension with
wide range of improvements to all layers.

Our driver can optimize accuracy when this flag is set, other drivers
might be happy to implement it since they already have a slow hw and
this flag would allow them to run better TCP/UDP performance while
still performing ptp hw stamping, some admins/apps will use it to avoid
stamping all traffic on tx, win win win.
Eran Ben Elisha Dec. 7, 2020, 11:05 a.m. UTC | #20
On 12/7/2020 10:37 AM, Saeed Mahameed wrote:
> On Sun, 2020-12-06 at 09:08 -0800, Richard Cochran wrote:
>> On Sun, Dec 06, 2020 at 03:37:47PM +0200, Eran Ben Elisha wrote:
>>> Adding new enum to the ioctl means we have add
>>> (HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY for example) all the way -
>>> drivers,
>>> kernel ptp, user space ptp, ethtool.
>>>
> 
> Not exactly,
> 1) the flag name should be HWTSTAMP_TX_PTP_EVENTS, similar to what we
> already have in RX, which will mean:
> HW stamp all PTP events, don't care about the rest.
> 
> 2) no need to add it to drivers from the get go, only drivers who are
> interested may implement it, and i am sure there are tons who would
> like to have this flag if their hw timestamping implementation is slow
> ! other drivers will just keep doing what they are doing, timestamp all
> traffic even if user requested this flag, again exactly like many other
> drivers do for RX flags (hwtstamp_rx_filters).
> 
>>> My concerns are:
>>> 1. Timestamp applications (like ptp4l or similar) will have to add
>>> support
>>> for configuring the driver to use HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY
>>> if
>>> supported via ioctl prior to packets transmit. From application
>>> point of
>>> view, the dual-modes (HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY ,
>>> HWTSTAMP_TX_ON)
>>> support is redundant, as it offers nothing new.
>>
>> Well said.
>>
> 
> disagree, it is not a dual mode, just allow the user to have better
> granularity for what hw stamps, exactly like what we have in rx.
> 
> we are not adding any new mechanism.
> 
>>> 2. Other vendors will have to support it as well, when not sure
>>> what is the
>>> expectation from them if they cannot improve accuracy between them.
>>
>> If there were multiple different devices out there with this kind of
>> implementation (different levels of accuracy with increasing run time
>> performance cost), then we could consider such a flag.  However, to
>> my
>> knowledge, this feature is unique to your device.
>>
> 
> I agree, but i never meant to have a flag that indicate two different
> levels of accuracy, that would be a very wild mistake for sure!
> 
> The new flag will be about selecting granularity of what gets a hw
> stamp and what doesn't, aligning with the RX filter API.
> 
>>> This feature is just an internal enhancement, and as such it should
>>> be added
>>> only as a vendor private configuration flag. We are not offering
>>> here about
>>> any standard for others to follow.
>>
>> +1
>>
> 
> Our driver feature is and internal enhancement yes, but the suggested
> flag is very far from indicating any internal enhancement, is actually
> an enhancement to the current API, and is a very simple extension with
> wide range of improvements to all layers.
> 
> Our driver can optimize accuracy when this flag is set, other drivers
> might be happy to implement it since they already have a slow hw and
> this flag would allow them to run better TCP/UDP performance while
> still performing ptp hw stamping, some admins/apps will use it to avoid
> stamping all traffic on tx, win win win.
> 
> 
Seems interesting. I can form such V2 patches soon.
Richard Cochran Dec. 7, 2020, 3:19 p.m. UTC | #21
On Mon, Dec 07, 2020 at 12:37:45AM -0800, Saeed Mahameed wrote:

> we are not adding any new mechanism.

Sorry, I didn't catch the beginning of this thread.  Are you proposing
adding HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY to net_tstamp.h ?

If so, then ...

> Our driver feature is and internal enhancement yes, but the suggested
> flag is very far from indicating any internal enhancement, is actually
> an enhancement to the current API, and is a very simple extension with
> wide range of improvements to all layers.

No, that would be no enhancement but rather a hack for poorly designed
hardware.
 
> Our driver can optimize accuracy when this flag is set, other drivers
> might be happy to implement it since they already have a slow hw

Name three other drivers that would "be happy" to implement this.  Can
you name even one other?

Thanks,
Richard
Jakub Kicinski Dec. 7, 2020, 8:29 p.m. UTC | #22
On Sun, 6 Dec 2020 15:36:38 +0200 Eran Ben Elisha wrote:
> On 12/5/2020 1:17 AM, Jakub Kicinski wrote:
> >> We only forward ptp traffic to the new special queue but we create more
> >> than one to avoid internal locking as we will utilize the tx softirq
> >> percpu.  
> > In other words to make the driver implementation simpler we'll have
> > a pretty basic feature hidden behind a ethtool priv knob and a number
> > of queues which doesn't match reality reported to user space. Hm.  
> 
> We are not hiding these queues from the netdev stack. We report them in 
> real num of TX queues and manage them as any other queue. The only 
> change is that select_queue() will select a stream to them if and only 
> if they match specific criteria.

Please read more carefully what you're replying to. That helps
communication and limits frustration quite a lot.

I said the queues are hidden behind the ethtool knob, as in they are
only instantiated when knob is turned from its default position.
Then you report to the stack that you have n+m queues, but in fact
there is only n queues that are of general use.
Jakub Kicinski Dec. 7, 2020, 8:42 p.m. UTC | #23
On Mon, 7 Dec 2020 07:19:06 -0800 Richard Cochran wrote:
> On Mon, Dec 07, 2020 at 12:37:45AM -0800, Saeed Mahameed wrote:
> > we are not adding any new mechanism.  
> 
> Sorry, I didn't catch the beginning of this thread.  Are you proposing
> adding HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY to net_tstamp.h ?
> 
> If so, then ...
> 
> > Our driver feature is and internal enhancement yes, but the suggested
> > flag is very far from indicating any internal enhancement, is actually
> > an enhancement to the current API, and is a very simple extension with
> > wide range of improvements to all layers.  
> 
> No, that would be no enhancement but rather a hack for poorly designed
> hardware.
> 
> > Our driver can optimize accuracy when this flag is set, other drivers
> > might be happy to implement it since they already have a slow hw  
> 
> Name three other drivers that would "be happy" to implement this.  Can
> you name even one other?

The behavior is not entirely dissimilar to the time stamps on
multi-layered devices (e.g. DSA switches). The time stamp can either 
be generated when the packet enters the device (current mlx5 behavior)
or when it actually egresses thru the MAC (what this set adds).

So while we could find other hardware like this if we squint hard enough
- I'm not sure how much practical use for CPU-side stamps there is in DSA.


My main concern is the user friendliness. I think there is no question
that user running ptp4l would want this mlx5 knob to be enabled. Would
we rather see a patch to ptp4l that turns per driver knob or should we
shoot for some form of an API that tells the kernel that we're
expecting ns level time accuracy? 

That's how I would phrase the dilemma here.
Saeed Mahameed Dec. 7, 2020, 10:04 p.m. UTC | #24
On Mon, 2020-12-07 at 12:42 -0800, Jakub Kicinski wrote:
> On Mon, 7 Dec 2020 07:19:06 -0800 Richard Cochran wrote:
> > On Mon, Dec 07, 2020 at 12:37:45AM -0800, Saeed Mahameed wrote:
> > > we are not adding any new mechanism.  
> > 
> > Sorry, I didn't catch the beginning of this thread.  Are you
> > proposing
> > adding HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY to net_tstamp.h ?
> > 
> > If so, then ...
> > 
> > > Our driver feature is and internal enhancement yes, but the
> > > suggested
> > > flag is very far from indicating any internal enhancement, is
> > > actually
> > > an enhancement to the current API, and is a very simple extension
> > > with
> > > wide range of improvements to all layers.  
> > 
> > No, that would be no enhancement but rather a hack for poorly
> > designed
> > hardware.
> > 

Why ? how is the new flag different from HWTSTAMP_TX_ONESTEP_SYNC ?
it is a way to fine tune the driver .. nothing is hacky about the new
flag.

> > > Our driver can optimize accuracy when this flag is set, other
> > > drivers
> > > might be happy to implement it since they already have a slow
> > > hw  
> > 
> > Name three other drivers that would "be happy" to implement
> > this.  Can
> > you name even one other?
> 
> The behavior is not entirely dissimilar to the time stamps on
> multi-layered devices (e.g. DSA switches). The time stamp can either 
> be generated when the packet enters the device (current mlx5
> behavior)
> or when it actually egresses thru the MAC (what this set adds).
> 
> So while we could find other hardware like this if we squint hard
> enough
> - I'm not sure how much practical use for CPU-side stamps there is in
> DSA.
> 
> 
> My main concern is the user friendliness. I think there is no
> question
> that user running ptp4l would want this mlx5 knob to be enabled.
> Would
> we rather see a patch to ptp4l that turns per driver knob or should
> we
> shoot for some form of an API that tells the kernel that we're
> expecting ns level time accuracy? 
> 
> That's how I would phrase the dilemma here.

This is why i think that the new PTP tx flag to let the driver know
that only PTP EVENT messages are important would be the perfect answer
for all of the above. this flag has a very standard definition, which
could also mean: improved precision for PTP messages if the HW can do
it, why not, ptp4l should always choose this flag if it is present, as
ptp4l shouldn't request ptp hw tstamp on all tx traffic as it is doing
today, it is just an overkill.

other options will be adding knew knob out of the scope of PTP APIs,
which is going to be as ugly as private flag.
Richard Cochran Dec. 8, 2020, 1:02 p.m. UTC | #25
On Mon, Dec 07, 2020 at 12:42:33PM -0800, Jakub Kicinski wrote:

> The behavior is not entirely dissimilar to the time stamps on
> multi-layered devices (e.g. DSA switches). The time stamp can either 
> be generated when the packet enters the device (current mlx5 behavior)
> or when it actually egresses thru the MAC (what this set adds).

To be useful, the time stamps must be taken on the external ports.
Generating the time stamp at the DMA reception in the device doesn't
even make sense, unless the delay through the device is constant.

> My main concern is the user friendliness. I think there is no question
> that user running ptp4l would want this mlx5 knob to be enabled.

Right.

> Would
> we rather see a patch to ptp4l that turns per driver knob or should we
> shoot for some form of an API that tells the kernel that we're
> expecting ns level time accuracy? 

This is a hardware-specific "feature".  One of the guiding principles
of the linuxptp user space stack is not to become a catalog of
workarounds for random hardware.  IMO the kernel's API should not
encourage "special" hardware either.  After all, we have lots and lots
of PTP hardware supported, all of them already working with the
existing API just fine.

My preference is for a global knob for users of this hardware, either

- a compile time Kconfig option on the driver, or
- some kind of sysctl/debugfs knob

Thanks,
Richard
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 83a67ca43a41..77961643d5a9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -25,7 +25,7 @@  mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \
 		en_tx.o en_rx.o en_dim.o en_txrx.o en/xdp.o en_stats.o \
 		en_selftest.o en/port.o en/monitor_stats.o en/health.o \
 		en/reporter_tx.o en/reporter_rx.o en/params.o en/xsk/pool.o \
-		en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o en/devlink.o
+		en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o en/devlink.o en/ptp.o
 
 #
 # Netdev extra
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 7f3bd3d406b3..6864c79d2d9a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -227,6 +227,7 @@  enum mlx5e_priv_flag {
 	MLX5E_PFLAG_RX_NO_CSUM_COMPLETE,
 	MLX5E_PFLAG_XDP_TX_MPWQE,
 	MLX5E_PFLAG_SKB_TX_MPWQE,
+	MLX5E_PFLAG_TX_PORT_TS,
 	MLX5E_NUM_PFLAGS, /* Keep last */
 };
 
@@ -338,6 +339,8 @@  struct mlx5e_skb_fifo {
 	u16 mask;
 };
 
+struct mlx5e_ptpsq;
+
 struct mlx5e_txqsq {
 	/* data path */
 
@@ -385,6 +388,7 @@  struct mlx5e_txqsq {
 	int                        txq_ix;
 	u32                        rate_limit;
 	struct work_struct         recover_work;
+	struct mlx5e_ptpsq        *ptpsq;
 } ____cacheline_aligned_in_smp;
 
 struct mlx5e_dma_info {
@@ -692,8 +696,11 @@  struct mlx5e_channel {
 	int                        cpu;
 };
 
+struct mlx5e_port_ptp;
+
 struct mlx5e_channels {
 	struct mlx5e_channel **c;
+	struct mlx5e_port_ptp  *port_ptp;
 	unsigned int           num;
 	struct mlx5e_params    params;
 };
@@ -708,6 +715,11 @@  struct mlx5e_channel_stats {
 	struct mlx5e_xdpsq_stats xsksq;
 } ____cacheline_aligned_in_smp;
 
+struct mlx5e_port_ptp_stats {
+	struct mlx5e_ch_stats ch;
+	struct mlx5e_sq_stats sq[MLX5E_MAX_NUM_TC];
+} ____cacheline_aligned_in_smp;
+
 enum {
 	MLX5E_STATE_OPENED,
 	MLX5E_STATE_DESTROYING,
@@ -777,8 +789,10 @@  struct mlx5e_scratchpad {
 
 struct mlx5e_priv {
 	/* priv data path fields - start */
-	struct mlx5e_txqsq *txq2sq[MLX5E_MAX_NUM_CHANNELS * MLX5E_MAX_NUM_TC];
+	/* +1 for port ptp ts */
+	struct mlx5e_txqsq *txq2sq[(MLX5E_MAX_NUM_CHANNELS + 1) * MLX5E_MAX_NUM_TC];
 	int channel_tc2realtxq[MLX5E_MAX_NUM_CHANNELS][MLX5E_MAX_NUM_TC];
+	int port_ptp_tc2realtxq[MLX5E_MAX_NUM_TC];
 #ifdef CONFIG_MLX5_CORE_EN_DCB
 	struct mlx5e_dcbx_dp       dcbx_dp;
 #endif
@@ -813,12 +827,15 @@  struct mlx5e_priv {
 	struct net_device         *netdev;
 	struct mlx5e_stats         stats;
 	struct mlx5e_channel_stats channel_stats[MLX5E_MAX_NUM_CHANNELS];
+	struct mlx5e_port_ptp_stats port_ptp_stats;
 	u16                        max_nch;
 	u8                         max_opened_tc;
+	bool                       port_ptp_opened;
 	struct hwtstamp_config     tstamp;
 	u16                        q_counter;
 	u16                        drop_rq_q_counter;
 	struct notifier_block      events_nb;
+	int                        num_tc_x_num_ch;
 
 	struct udp_tunnel_nic_info nic_info;
 #ifdef CONFIG_MLX5_CORE_EN_DCB
@@ -993,7 +1010,17 @@  void mlx5e_deactivate_icosq(struct mlx5e_icosq *icosq);
 int mlx5e_modify_sq(struct mlx5_core_dev *mdev, u32 sqn,
 		    struct mlx5e_modify_sq_param *p);
 void mlx5e_activate_txqsq(struct mlx5e_txqsq *sq);
+void mlx5e_deactivate_txqsq(struct mlx5e_txqsq *sq);
+void mlx5e_free_txqsq(struct mlx5e_txqsq *sq);
 void mlx5e_tx_disable_queue(struct netdev_queue *txq);
+int mlx5e_alloc_txqsq_db(struct mlx5e_txqsq *sq, int numa);
+void mlx5e_free_txqsq_db(struct mlx5e_txqsq *sq);
+struct mlx5e_create_sq_param;
+int mlx5e_create_sq_rdy(struct mlx5_core_dev *mdev,
+			struct mlx5e_sq_param *param,
+			struct mlx5e_create_sq_param *csp,
+			u32 *sqn);
+void mlx5e_tx_err_cqe_work(struct work_struct *recover_work);
 
 static inline bool mlx5_tx_swp_supported(struct mlx5_core_dev *mdev)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
index 187007ad3349..70e463712b7f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
@@ -41,6 +41,14 @@  struct mlx5e_channel_param {
 	struct mlx5e_sq_param      async_icosq;
 };
 
+struct mlx5e_create_sq_param {
+	struct mlx5_wq_ctrl        *wq_ctrl;
+	u32                         cqn;
+	u32                         tisn;
+	u8                          tis_lst_sz;
+	u8                          min_inline_mode;
+};
+
 static inline bool mlx5e_qid_get_ch_if_in_group(struct mlx5e_params *params,
 						u16 qid,
 						enum mlx5e_rq_group group,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
new file mode 100644
index 000000000000..8639b5104df7
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
@@ -0,0 +1,360 @@ 
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+// Copyright (c) 2020 Mellanox Technologies
+
+#include "en/ptp.h"
+#include "en/txrx.h"
+#include "lib/clock.h"
+
+static int mlx5e_ptp_napi_poll(struct napi_struct *napi, int budget)
+{
+	struct mlx5e_port_ptp *c = container_of(napi, struct mlx5e_port_ptp,
+						napi);
+	struct mlx5e_ch_stats *ch_stats = c->stats;
+	bool busy = false;
+	int work_done = 0;
+	int i;
+
+	rcu_read_lock();
+
+	ch_stats->poll++;
+
+	for (i = 0; i < c->num_tc; i++)
+		busy |= mlx5e_poll_tx_cq(&c->ptpsq[i].txqsq.cq, budget);
+
+	if (busy) {
+		work_done = budget;
+		goto out;
+	}
+
+	if (unlikely(!napi_complete_done(napi, work_done)))
+		goto out;
+
+	ch_stats->arm++;
+
+	for (i = 0; i < c->num_tc; i++)
+		mlx5e_cq_arm(&c->ptpsq[i].txqsq.cq);
+
+out:
+	rcu_read_unlock();
+
+	return work_done;
+}
+
+static int mlx5e_ptp_alloc_txqsq(struct mlx5e_port_ptp *c, int txq_ix,
+				 struct mlx5e_params *params,
+				 struct mlx5e_sq_param *param,
+				 struct mlx5e_txqsq *sq, int tc,
+				 struct mlx5e_ptpsq *ptpsq)
+{
+	void *sqc_wq               = MLX5_ADDR_OF(sqc, param->sqc, wq);
+	struct mlx5_core_dev *mdev = c->mdev;
+	struct mlx5_wq_cyc *wq = &sq->wq;
+	int err;
+	int node;
+
+	sq->pdev      = c->pdev;
+	sq->tstamp    = c->tstamp;
+	sq->clock     = &mdev->clock;
+	sq->mkey_be   = c->mkey_be;
+	sq->netdev    = c->netdev;
+	sq->priv      = c->priv;
+	sq->mdev      = mdev;
+	sq->ch_ix     = c->ix;
+	sq->txq_ix    = txq_ix;
+	sq->uar_map   = mdev->mlx5e_res.bfreg.map;
+	sq->min_inline_mode = params->tx_min_inline_mode;
+	sq->hw_mtu    = MLX5E_SW2HW_MTU(params, params->sw_mtu);
+	sq->stats     = &c->priv->port_ptp_stats.sq[tc];
+	sq->ptpsq     = ptpsq;
+	INIT_WORK(&sq->recover_work, mlx5e_tx_err_cqe_work);
+	if (!MLX5_CAP_ETH(mdev, wqe_vlan_insert))
+		set_bit(MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE, &sq->state);
+	sq->stop_room = param->stop_room;
+
+	node = dev_to_node(mlx5_core_dma_dev(mdev));
+
+	param->wq.db_numa_node = node;
+	err = mlx5_wq_cyc_create(mdev, &param->wq, sqc_wq, wq, &sq->wq_ctrl);
+	if (err)
+		return err;
+	wq->db    = &wq->db[MLX5_SND_DBR];
+
+	err = mlx5e_alloc_txqsq_db(sq, node);
+	if (err)
+		goto err_sq_wq_destroy;
+
+	return 0;
+
+err_sq_wq_destroy:
+	mlx5_wq_destroy(&sq->wq_ctrl);
+
+	return err;
+}
+
+static void mlx5e_ptp_destroy_sq(struct mlx5_core_dev *mdev, u32 sqn)
+{
+	mlx5_core_destroy_sq(mdev, sqn);
+}
+
+static int mlx5e_ptp_open_txqsq(struct mlx5e_port_ptp *c, u32 tisn,
+				int txq_ix, struct mlx5e_ptp_params *cparams,
+				int tc, struct mlx5e_ptpsq *ptpsq)
+{
+	struct mlx5e_sq_param *sqp = &cparams->txq_sq_param;
+	struct mlx5e_txqsq *txqsq = &ptpsq->txqsq;
+	struct mlx5e_create_sq_param csp = {};
+	int err;
+
+	err = mlx5e_ptp_alloc_txqsq(c, txq_ix, &cparams->params, sqp,
+				    txqsq, tc, ptpsq);
+	if (err)
+		return err;
+
+	csp.tisn            = tisn;
+	csp.tis_lst_sz      = 1;
+	csp.cqn             = txqsq->cq.mcq.cqn;
+	csp.wq_ctrl         = &txqsq->wq_ctrl;
+	csp.min_inline_mode = txqsq->min_inline_mode;
+
+	err = mlx5e_create_sq_rdy(c->mdev, sqp, &csp, &txqsq->sqn);
+	if (err)
+		goto err_free_txqsq;
+
+	return 0;
+
+err_free_txqsq:
+	mlx5e_free_txqsq(txqsq);
+
+	return err;
+}
+
+static void mlx5e_ptp_close_txqsq(struct mlx5e_ptpsq *ptpsq)
+{
+	struct mlx5e_txqsq *sq = &ptpsq->txqsq;
+	struct mlx5_core_dev *mdev = sq->mdev;
+
+	cancel_work_sync(&sq->recover_work);
+	mlx5e_ptp_destroy_sq(mdev, sq->sqn);
+	mlx5e_free_txqsq_descs(sq);
+	mlx5e_free_txqsq(sq);
+}
+
+static int mlx5e_ptp_open_txqsqs(struct mlx5e_port_ptp *c,
+				 struct mlx5e_ptp_params *cparams)
+{
+	struct mlx5e_params *params = &cparams->params;
+	int ix_base;
+	int err;
+	int tc;
+
+	ix_base = params->num_tc * params->num_channels;
+
+	for (tc = 0; tc < params->num_tc; tc++) {
+		int txq_ix = ix_base + tc;
+
+		err = mlx5e_ptp_open_txqsq(c, c->priv->tisn[c->lag_port][tc], txq_ix,
+					   cparams, tc, &c->ptpsq[tc]);
+		if (err)
+			goto close_txqsq;
+	}
+
+	return 0;
+
+close_txqsq:
+	for (--tc; tc >= 0; tc--)
+		mlx5e_ptp_close_txqsq(&c->ptpsq[tc]);
+
+	return err;
+}
+
+static void mlx5e_ptp_close_txqsqs(struct mlx5e_port_ptp *c)
+{
+	int tc;
+
+	for (tc = 0; tc < c->num_tc; tc++)
+		mlx5e_ptp_close_txqsq(&c->ptpsq[tc]);
+}
+
+static int mlx5e_ptp_open_cqs(struct mlx5e_port_ptp *c,
+			      struct mlx5e_ptp_params *cparams)
+{
+	struct mlx5e_params *params = &cparams->params;
+	struct mlx5e_create_cq_param ccp = {};
+	struct dim_cq_moder ptp_moder = {};
+	struct mlx5e_cq_param *cq_param;
+	int err;
+	int tc;
+
+	ccp.node     = dev_to_node(mlx5_core_dma_dev(c->mdev));
+	ccp.ch_stats = c->stats;
+	ccp.napi     = &c->napi;
+	ccp.ix       = c->ix;
+
+	cq_param = &cparams->txq_sq_param.cqp;
+
+	for (tc = 0; tc < params->num_tc; tc++) {
+		struct mlx5e_cq *cq = &c->ptpsq[tc].txqsq.cq;
+
+		err = mlx5e_open_cq(c->priv, ptp_moder, cq_param, &ccp, cq);
+		if (err)
+			goto out_err_txqsq_cq;
+	}
+
+	return 0;
+
+out_err_txqsq_cq:
+	for (--tc; tc >= 0; tc--)
+		mlx5e_close_cq(&c->ptpsq[tc].txqsq.cq);
+
+	return err;
+}
+
+static void mlx5e_ptp_close_cqs(struct mlx5e_port_ptp *c)
+{
+	int tc;
+
+	for (tc = 0; tc < c->num_tc; tc++)
+		mlx5e_close_cq(&c->ptpsq[tc].txqsq.cq);
+}
+
+static void mlx5e_ptp_build_sq_param(struct mlx5e_priv *priv,
+				     struct mlx5e_params *params,
+				     struct mlx5e_sq_param *param)
+{
+	void *sqc = param->sqc;
+	void *wq;
+
+	mlx5e_build_sq_param_common(priv, param);
+
+	wq = MLX5_ADDR_OF(sqc, sqc, wq);
+	MLX5_SET(wq, wq, log_wq_sz, params->log_sq_size);
+	param->stop_room = mlx5e_stop_room_for_wqe(MLX5_SEND_WQE_MAX_WQEBBS);
+	mlx5e_build_tx_cq_param(priv, params, &param->cqp);
+}
+
+static void mlx5e_ptp_build_params(struct mlx5e_port_ptp *c,
+				   struct mlx5e_ptp_params *cparams,
+				   struct mlx5e_params *orig)
+{
+	struct mlx5e_params *params = &cparams->params;
+
+	params->tx_min_inline_mode = orig->tx_min_inline_mode;
+	params->num_channels = orig->num_channels;
+	params->hard_mtu = orig->hard_mtu;
+	params->sw_mtu = orig->sw_mtu;
+	params->num_tc = orig->num_tc;
+
+	/* SQ */
+	params->log_sq_size = orig->log_sq_size;
+
+	mlx5e_ptp_build_sq_param(c->priv, params, &cparams->txq_sq_param);
+}
+
+static int mlx5e_ptp_open_queues(struct mlx5e_port_ptp *c,
+				 struct mlx5e_ptp_params *cparams)
+{
+	int err;
+
+	err = mlx5e_ptp_open_cqs(c, cparams);
+	if (err)
+		return err;
+
+	napi_enable(&c->napi);
+
+	err = mlx5e_ptp_open_txqsqs(c, cparams);
+	if (err)
+		goto disable_napi;
+
+	return 0;
+
+disable_napi:
+	napi_disable(&c->napi);
+	mlx5e_ptp_close_cqs(c);
+
+	return err;
+}
+
+static void mlx5e_ptp_close_queues(struct mlx5e_port_ptp *c)
+{
+	mlx5e_ptp_close_txqsqs(c);
+	napi_disable(&c->napi);
+	mlx5e_ptp_close_cqs(c);
+}
+
+int mlx5e_port_ptp_open(struct mlx5e_priv *priv, struct mlx5e_params *params,
+			u8 lag_port, struct mlx5e_port_ptp **cp)
+{
+	struct net_device *netdev = priv->netdev;
+	struct mlx5_core_dev *mdev = priv->mdev;
+	struct mlx5e_ptp_params *cparams;
+	struct mlx5e_port_ptp *c;
+	unsigned int irq;
+	int err;
+	int eqn;
+
+	err = mlx5_vector2eqn(priv->mdev, 0, &eqn, &irq);
+	if (err)
+		return err;
+
+	c = kvzalloc_node(sizeof(*c), GFP_KERNEL, dev_to_node(mlx5_core_dma_dev(mdev)));
+	cparams = kvzalloc(sizeof(*cparams), GFP_KERNEL);
+	if (!c || !cparams)
+		return -ENOMEM;
+
+	c->priv     = priv;
+	c->mdev     = priv->mdev;
+	c->tstamp   = &priv->tstamp;
+	c->ix       = 0;
+	c->pdev     = mlx5_core_dma_dev(priv->mdev);
+	c->netdev   = priv->netdev;
+	c->mkey_be  = cpu_to_be32(priv->mdev->mlx5e_res.mkey.key);
+	c->num_tc   = params->num_tc;
+	c->stats    = &priv->port_ptp_stats.ch;
+	c->irq_desc = irq_to_desc(irq);
+	c->lag_port = lag_port;
+
+	netif_napi_add(netdev, &c->napi, mlx5e_ptp_napi_poll, 64);
+
+	mlx5e_ptp_build_params(c, cparams, params);
+
+	err = mlx5e_ptp_open_queues(c, cparams);
+	if (unlikely(err))
+		goto err_napi_del;
+
+	*cp = c;
+
+	kvfree(cparams);
+
+	return 0;
+
+err_napi_del:
+	netif_napi_del(&c->napi);
+
+	kvfree(cparams);
+	kvfree(c);
+	return err;
+}
+
+void mlx5e_port_ptp_close(struct mlx5e_port_ptp *c)
+{
+	mlx5e_ptp_close_queues(c);
+	netif_napi_del(&c->napi);
+
+	kvfree(c);
+}
+
+void mlx5e_ptp_activate_channel(struct mlx5e_port_ptp *c)
+{
+	int tc;
+
+	for (tc = 0; tc < c->num_tc; tc++)
+		mlx5e_activate_txqsq(&c->ptpsq[tc].txqsq);
+}
+
+void mlx5e_ptp_deactivate_channel(struct mlx5e_port_ptp *c)
+{
+	int tc;
+
+	for (tc = 0; tc < c->num_tc; tc++)
+		mlx5e_deactivate_txqsq(&c->ptpsq[tc].txqsq);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.h
new file mode 100644
index 000000000000..daa3b6953e3f
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.h
@@ -0,0 +1,48 @@ 
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies. */
+
+#ifndef __MLX5_EN_PTP_H__
+#define __MLX5_EN_PTP_H__
+
+#include "en.h"
+#include "en/params.h"
+#include "en_stats.h"
+
+struct mlx5e_ptpsq {
+	struct mlx5e_txqsq       txqsq;
+};
+
+struct mlx5e_port_ptp {
+	/* data path */
+	struct mlx5e_ptpsq         ptpsq[MLX5E_MAX_NUM_TC];
+	struct napi_struct         napi;
+	struct device             *pdev;
+	struct net_device         *netdev;
+	__be32                     mkey_be;
+	u8                         num_tc;
+	u8                         lag_port;
+
+	/* data path - accessed per napi poll */
+	struct irq_desc *irq_desc;
+	struct mlx5e_ch_stats     *stats;
+
+	/* control */
+	struct mlx5e_priv         *priv;
+	struct mlx5_core_dev      *mdev;
+	struct hwtstamp_config    *tstamp;
+	DECLARE_BITMAP(state, MLX5E_CHANNEL_NUM_STATES);
+	int                        ix;
+};
+
+struct mlx5e_ptp_params {
+	struct mlx5e_params        params;
+	struct mlx5e_sq_param      txq_sq_param;
+};
+
+int mlx5e_port_ptp_open(struct mlx5e_priv *priv, struct mlx5e_params *params,
+			u8 lag_port, struct mlx5e_port_ptp **cp);
+void mlx5e_port_ptp_close(struct mlx5e_port_ptp *c);
+void mlx5e_ptp_activate_channel(struct mlx5e_port_ptp *c);
+void mlx5e_ptp_deactivate_channel(struct mlx5e_port_ptp *c);
+
+#endif /* __MLX5_EN_PTP_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index 88b3b21d1068..c55a2ad10599 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -2,6 +2,7 @@ 
 /* Copyright (c) 2019 Mellanox Technologies. */
 
 #include "health.h"
+#include "en/ptp.h"
 
 static int mlx5e_wait_for_sq_flush(struct mlx5e_txqsq *sq)
 {
@@ -141,8 +142,8 @@  static int mlx5e_tx_reporter_recover(struct devlink_health_reporter *reporter,
 }
 
 static int
-mlx5e_tx_reporter_build_diagnose_output(struct devlink_fmsg *fmsg,
-					struct mlx5e_txqsq *sq, int tc)
+mlx5e_tx_reporter_build_diagnose_output_sq_common(struct devlink_fmsg *fmsg,
+						  struct mlx5e_txqsq *sq, int tc)
 {
 	bool stopped = netif_xmit_stopped(sq->txq);
 	struct mlx5e_priv *priv = sq->priv;
@@ -153,14 +154,6 @@  mlx5e_tx_reporter_build_diagnose_output(struct devlink_fmsg *fmsg,
 	if (err)
 		return err;
 
-	err = devlink_fmsg_obj_nest_start(fmsg);
-	if (err)
-		return err;
-
-	err = devlink_fmsg_u32_pair_put(fmsg, "channel ix", sq->ch_ix);
-	if (err)
-		return err;
-
 	err = devlink_fmsg_u32_pair_put(fmsg, "tc", tc);
 	if (err)
 		return err;
@@ -193,7 +186,24 @@  mlx5e_tx_reporter_build_diagnose_output(struct devlink_fmsg *fmsg,
 	if (err)
 		return err;
 
-	err = mlx5e_health_eq_diag_fmsg(sq->cq.mcq.eq, fmsg);
+	return mlx5e_health_eq_diag_fmsg(sq->cq.mcq.eq, fmsg);
+}
+
+static int
+mlx5e_tx_reporter_build_diagnose_output(struct devlink_fmsg *fmsg,
+					struct mlx5e_txqsq *sq, int tc)
+{
+	int err;
+
+	err = devlink_fmsg_obj_nest_start(fmsg);
+	if (err)
+		return err;
+
+	err = devlink_fmsg_u32_pair_put(fmsg, "channel ix", sq->ch_ix);
+	if (err)
+		return err;
+
+	err = mlx5e_tx_reporter_build_diagnose_output_sq_common(fmsg, sq, tc);
 	if (err)
 		return err;
 
@@ -204,49 +214,116 @@  mlx5e_tx_reporter_build_diagnose_output(struct devlink_fmsg *fmsg,
 	return 0;
 }
 
-static int mlx5e_tx_reporter_diagnose(struct devlink_health_reporter *reporter,
-				      struct devlink_fmsg *fmsg,
-				      struct netlink_ext_ack *extack)
+static int
+mlx5e_tx_reporter_build_diagnose_output_ptpsq(struct devlink_fmsg *fmsg,
+					      struct mlx5e_ptpsq *ptpsq, int tc)
 {
-	struct mlx5e_priv *priv = devlink_health_reporter_priv(reporter);
-	struct mlx5e_txqsq *generic_sq = priv->txq2sq[0];
-	u32 sq_stride, sq_sz;
-
-	int i, tc, err = 0;
+	int err;
 
-	mutex_lock(&priv->state_lock);
+	err = devlink_fmsg_obj_nest_start(fmsg);
+	if (err)
+		return err;
 
-	if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
-		goto unlock;
+	err = devlink_fmsg_string_pair_put(fmsg, "channel", "ptp");
+	if (err)
+		return err;
 
-	sq_sz = mlx5_wq_cyc_get_size(&generic_sq->wq);
-	sq_stride = MLX5_SEND_WQE_BB;
+	err = mlx5e_tx_reporter_build_diagnose_output_sq_common(fmsg,
+								&ptpsq->txqsq,
+								tc);
+	if (err)
+		return err;
 
-	err = mlx5e_health_fmsg_named_obj_nest_start(fmsg, "Common Config");
+	err = devlink_fmsg_obj_nest_end(fmsg);
 	if (err)
-		goto unlock;
+		return err;
+
+	return 0;
+}
+
+static int
+mlx5e_tx_reporter_diagnose_generic_txqsq(struct devlink_fmsg *fmsg,
+					 struct mlx5e_txqsq *txqsq)
+{
+	u32 sq_stride, sq_sz;
+	int err;
 
 	err = mlx5e_health_fmsg_named_obj_nest_start(fmsg, "SQ");
 	if (err)
-		goto unlock;
+		return err;
+
+	sq_sz = mlx5_wq_cyc_get_size(&txqsq->wq);
+	sq_stride = MLX5_SEND_WQE_BB;
 
 	err = devlink_fmsg_u64_pair_put(fmsg, "stride size", sq_stride);
 	if (err)
-		goto unlock;
+		return err;
 
 	err = devlink_fmsg_u32_pair_put(fmsg, "size", sq_sz);
 	if (err)
-		goto unlock;
+		return err;
 
-	err = mlx5e_health_cq_common_diag_fmsg(&generic_sq->cq, fmsg);
+	err = mlx5e_health_cq_common_diag_fmsg(&txqsq->cq, fmsg);
 	if (err)
-		goto unlock;
+		return err;
+
+	return mlx5e_health_fmsg_named_obj_nest_end(fmsg);
+}
+
+static int
+mlx5e_tx_reporter_diagnose_common_config(struct devlink_health_reporter *reporter,
+					 struct devlink_fmsg *fmsg)
+{
+	struct mlx5e_priv *priv = devlink_health_reporter_priv(reporter);
+	struct mlx5e_txqsq *generic_sq = priv->txq2sq[0];
+	struct mlx5e_ptpsq *generic_ptpsq;
+	int err;
+
+	err = mlx5e_health_fmsg_named_obj_nest_start(fmsg, "Common Config");
+	if (err)
+		return err;
+
+	err = mlx5e_tx_reporter_diagnose_generic_txqsq(fmsg, generic_sq);
+	if (err)
+		return err;
+
+	generic_ptpsq = priv->channels.port_ptp ?
+			&priv->channels.port_ptp->ptpsq[0] :
+			NULL;
+	if (!generic_ptpsq)
+		goto out;
+
+	err = mlx5e_health_fmsg_named_obj_nest_start(fmsg, "PTP");
+	if (err)
+		return err;
+
+	err = mlx5e_tx_reporter_diagnose_generic_txqsq(fmsg, &generic_ptpsq->txqsq);
+	if (err)
+		return err;
 
 	err = mlx5e_health_fmsg_named_obj_nest_end(fmsg);
 	if (err)
+		return err;
+
+out:
+	return mlx5e_health_fmsg_named_obj_nest_end(fmsg);
+}
+
+static int mlx5e_tx_reporter_diagnose(struct devlink_health_reporter *reporter,
+				      struct devlink_fmsg *fmsg,
+				      struct netlink_ext_ack *extack)
+{
+	struct mlx5e_priv *priv = devlink_health_reporter_priv(reporter);
+	struct mlx5e_port_ptp *ptp_ch = priv->channels.port_ptp;
+
+	int i, tc, err = 0;
+
+	mutex_lock(&priv->state_lock);
+
+	if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
 		goto unlock;
 
-	err = mlx5e_health_fmsg_named_obj_nest_end(fmsg);
+	err = mlx5e_tx_reporter_diagnose_common_config(reporter, fmsg);
 	if (err)
 		goto unlock;
 
@@ -265,6 +342,19 @@  static int mlx5e_tx_reporter_diagnose(struct devlink_health_reporter *reporter,
 				goto unlock;
 		}
 	}
+
+	if (!ptp_ch)
+		goto close_sqs_nest;
+
+	for (tc = 0; tc < priv->channels.params.num_tc; tc++) {
+		err = mlx5e_tx_reporter_build_diagnose_output_ptpsq(fmsg,
+								    &ptp_ch->ptpsq[tc],
+								    tc);
+		if (err)
+			goto unlock;
+	}
+
+close_sqs_nest:
 	err = devlink_fmsg_arr_pair_nest_end(fmsg);
 	if (err)
 		goto unlock;
@@ -338,6 +428,7 @@  static int mlx5e_tx_reporter_dump_sq(struct mlx5e_priv *priv, struct devlink_fms
 static int mlx5e_tx_reporter_dump_all_sqs(struct mlx5e_priv *priv,
 					  struct devlink_fmsg *fmsg)
 {
+	struct mlx5e_port_ptp *ptp_ch = priv->channels.port_ptp;
 	struct mlx5_rsc_key key = {};
 	int i, tc, err;
 
@@ -373,6 +464,17 @@  static int mlx5e_tx_reporter_dump_all_sqs(struct mlx5e_priv *priv,
 				return err;
 		}
 	}
+
+	if (ptp_ch) {
+		for (tc = 0; tc < priv->channels.params.num_tc; tc++) {
+			struct mlx5e_txqsq *sq = &ptp_ch->ptpsq[tc].txqsq;
+
+			err = mlx5e_health_queue_dump(priv, fmsg, sq->sqn, "PTP SQ");
+			if (err)
+				return err;
+		}
+	}
+
 	return devlink_fmsg_arr_pair_nest_end(fmsg);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 42e61dc28ead..30542d98ab27 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1946,6 +1946,38 @@  static int set_pflag_skb_tx_mpwqe(struct net_device *netdev, bool enable)
 	return set_pflag_tx_mpwqe_common(netdev, MLX5E_PFLAG_SKB_TX_MPWQE, enable);
 }
 
+static int set_pflag_tx_port_ts(struct net_device *netdev, bool enable)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct mlx5_core_dev *mdev = priv->mdev;
+	struct mlx5e_channels new_channels = {};
+	int err;
+
+	if (!MLX5_CAP_GEN(mdev, ts_cqe_to_dest_cqn))
+		return -EOPNOTSUPP;
+
+	new_channels.params = priv->channels.params;
+	MLX5E_SET_PFLAG(&new_channels.params, MLX5E_PFLAG_TX_PORT_TS, enable);
+	/* No need to verify SQ stop room as
+	 * ptpsq.txqsq.stop_room <= generic_sq->stop_room, and both
+	 * has the same log_sq_size.
+	 */
+
+	if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
+		priv->channels.params = new_channels.params;
+		err = mlx5e_num_channels_changed(priv);
+		goto out;
+	}
+
+	err = mlx5e_safe_switch_channels(priv, &new_channels,
+					 mlx5e_num_channels_changed_ctx, NULL);
+out:
+	if (!err)
+		priv->port_ptp_opened = true;
+
+	return err;
+}
+
 static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
 	{ "rx_cqe_moder",        set_pflag_rx_cqe_based_moder },
 	{ "tx_cqe_moder",        set_pflag_tx_cqe_based_moder },
@@ -1954,6 +1986,7 @@  static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
 	{ "rx_no_csum_complete", set_pflag_rx_no_csum_complete },
 	{ "xdp_tx_mpwqe",        set_pflag_xdp_tx_mpwqe },
 	{ "skb_tx_mpwqe",        set_pflag_skb_tx_mpwqe },
+	{ "tx_port_ts",          set_pflag_tx_port_ts },
 };
 
 static int mlx5e_handle_pflag(struct net_device *netdev,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 3ea15d62acd9..e36a13238271 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -64,6 +64,7 @@ 
 #include "en/hv_vhca_stats.h"
 #include "en/devlink.h"
 #include "lib/mlx5.h"
+#include "en/ptp.h"
 
 bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
 {
@@ -1083,14 +1084,14 @@  static void mlx5e_free_icosq(struct mlx5e_icosq *sq)
 	mlx5_wq_destroy(&sq->wq_ctrl);
 }
 
-static void mlx5e_free_txqsq_db(struct mlx5e_txqsq *sq)
+void mlx5e_free_txqsq_db(struct mlx5e_txqsq *sq)
 {
 	kvfree(sq->db.wqe_info);
 	kvfree(sq->db.skb_fifo.fifo);
 	kvfree(sq->db.dma_fifo);
 }
 
-static int mlx5e_alloc_txqsq_db(struct mlx5e_txqsq *sq, int numa)
+int mlx5e_alloc_txqsq_db(struct mlx5e_txqsq *sq, int numa)
 {
 	int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
 	int df_sz = wq_sz * MLX5_SEND_WQEBB_NUM_DS;
@@ -1118,7 +1119,6 @@  static int mlx5e_alloc_txqsq_db(struct mlx5e_txqsq *sq, int numa)
 	return 0;
 }
 
-static void mlx5e_tx_err_cqe_work(struct work_struct *recover_work);
 static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
 			     int txq_ix,
 			     struct mlx5e_params *params,
@@ -1176,20 +1176,12 @@  static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
 	return err;
 }
 
-static void mlx5e_free_txqsq(struct mlx5e_txqsq *sq)
+void mlx5e_free_txqsq(struct mlx5e_txqsq *sq)
 {
 	mlx5e_free_txqsq_db(sq);
 	mlx5_wq_destroy(&sq->wq_ctrl);
 }
 
-struct mlx5e_create_sq_param {
-	struct mlx5_wq_ctrl        *wq_ctrl;
-	u32                         cqn;
-	u32                         tisn;
-	u8                          tis_lst_sz;
-	u8                          min_inline_mode;
-};
-
 static int mlx5e_create_sq(struct mlx5_core_dev *mdev,
 			   struct mlx5e_sq_param *param,
 			   struct mlx5e_create_sq_param *csp,
@@ -1271,10 +1263,10 @@  static void mlx5e_destroy_sq(struct mlx5_core_dev *mdev, u32 sqn)
 	mlx5_core_destroy_sq(mdev, sqn);
 }
 
-static int mlx5e_create_sq_rdy(struct mlx5_core_dev *mdev,
-			       struct mlx5e_sq_param *param,
-			       struct mlx5e_create_sq_param *csp,
-			       u32 *sqn)
+int mlx5e_create_sq_rdy(struct mlx5_core_dev *mdev,
+			struct mlx5e_sq_param *param,
+			struct mlx5e_create_sq_param *csp,
+			u32 *sqn)
 {
 	struct mlx5e_modify_sq_param msp = {0};
 	int err;
@@ -1350,7 +1342,7 @@  void mlx5e_tx_disable_queue(struct netdev_queue *txq)
 	__netif_tx_unlock_bh(txq);
 }
 
-static void mlx5e_deactivate_txqsq(struct mlx5e_txqsq *sq)
+void mlx5e_deactivate_txqsq(struct mlx5e_txqsq *sq)
 {
 	struct mlx5_wq_cyc *wq = &sq->wq;
 
@@ -1389,7 +1381,7 @@  static void mlx5e_close_txqsq(struct mlx5e_txqsq *sq)
 	mlx5e_free_txqsq(sq);
 }
 
-static void mlx5e_tx_err_cqe_work(struct work_struct *recover_work)
+void mlx5e_tx_err_cqe_work(struct work_struct *recover_work)
 {
 	struct mlx5e_txqsq *sq = container_of(recover_work, struct mlx5e_txqsq,
 					      recover_work);
@@ -2374,6 +2366,13 @@  int mlx5e_open_channels(struct mlx5e_priv *priv,
 			goto err_close_channels;
 	}
 
+	if (MLX5E_GET_PFLAG(&chs->params, MLX5E_PFLAG_TX_PORT_TS)) {
+		err = mlx5e_port_ptp_open(priv, &chs->params, chs->c[0]->lag_port,
+					  &chs->port_ptp);
+		if (err)
+			goto err_close_channels;
+	}
+
 	mlx5e_health_channels_update(priv);
 	kvfree(cparam);
 	return 0;
@@ -2395,6 +2394,9 @@  static void mlx5e_activate_channels(struct mlx5e_channels *chs)
 
 	for (i = 0; i < chs->num; i++)
 		mlx5e_activate_channel(chs->c[i]);
+
+	if (chs->port_ptp)
+		mlx5e_ptp_activate_channel(chs->port_ptp);
 }
 
 #define MLX5E_RQ_WQES_TIMEOUT 20000 /* msecs */
@@ -2421,6 +2423,9 @@  static void mlx5e_deactivate_channels(struct mlx5e_channels *chs)
 {
 	int i;
 
+	if (chs->port_ptp)
+		mlx5e_ptp_deactivate_channel(chs->port_ptp);
+
 	for (i = 0; i < chs->num; i++)
 		mlx5e_deactivate_channel(chs->c[i]);
 }
@@ -2429,6 +2434,9 @@  void mlx5e_close_channels(struct mlx5e_channels *chs)
 {
 	int i;
 
+	if (chs->port_ptp)
+		mlx5e_port_ptp_close(chs->port_ptp);
+
 	for (i = 0; i < chs->num; i++)
 		mlx5e_close_channel(chs->c[i]);
 
@@ -2914,6 +2922,8 @@  static int mlx5e_update_netdev_queues(struct mlx5e_priv *priv)
 	nch = priv->channels.params.num_channels;
 	ntc = priv->channels.params.num_tc;
 	num_txqs = nch * ntc;
+	if (MLX5E_GET_PFLAG(&priv->channels.params, MLX5E_PFLAG_TX_PORT_TS))
+		num_txqs += ntc;
 	num_rxqs = nch * priv->profile->rq_groups;
 
 	mlx5e_netdev_set_tcs(netdev, nch, ntc);
@@ -2987,14 +2997,13 @@  MLX5E_DEFINE_PREACTIVATE_WRAPPER_CTX(mlx5e_num_channels_changed);
 
 static void mlx5e_build_txq_maps(struct mlx5e_priv *priv)
 {
-	int i, ch;
+	int i, ch, tc, num_tc;
 
 	ch = priv->channels.num;
+	num_tc = priv->channels.params.num_tc;
 
 	for (i = 0; i < ch; i++) {
-		int tc;
-
-		for (tc = 0; tc < priv->channels.params.num_tc; tc++) {
+		for (tc = 0; tc < num_tc; tc++) {
 			struct mlx5e_channel *c = priv->channels.c[i];
 			struct mlx5e_txqsq *sq = &c->sq[tc];
 
@@ -3002,10 +3011,28 @@  static void mlx5e_build_txq_maps(struct mlx5e_priv *priv)
 			priv->channel_tc2realtxq[i][tc] = i + tc * ch;
 		}
 	}
+
+	if (!priv->channels.port_ptp)
+		return;
+
+	for (tc = 0; tc < num_tc; tc++) {
+		struct mlx5e_port_ptp *c = priv->channels.port_ptp;
+		struct mlx5e_txqsq *sq = &c->ptpsq[tc].txqsq;
+
+		priv->txq2sq[sq->txq_ix] = sq;
+		priv->port_ptp_tc2realtxq[tc] = priv->num_tc_x_num_ch + tc;
+	}
+}
+
+static void mlx5e_update_num_tc_x_num_ch(struct mlx5e_priv *priv)
+{
+	priv->num_tc_x_num_ch = priv->channels.params.num_tc *
+				priv->channels.num;
 }
 
 void mlx5e_activate_priv_channels(struct mlx5e_priv *priv)
 {
+	mlx5e_update_num_tc_x_num_ch(priv);
 	mlx5e_build_txq_maps(priv);
 	mlx5e_activate_channels(&priv->channels);
 	mlx5e_xdp_tx_enable(priv);
@@ -4342,6 +4369,7 @@  static void mlx5e_tx_timeout_work(struct work_struct *work)
 {
 	struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv,
 					       tx_timeout_work);
+	struct net_device *netdev = priv->netdev;
 	int i;
 
 	rtnl_lock();
@@ -4350,9 +4378,9 @@  static void mlx5e_tx_timeout_work(struct work_struct *work)
 	if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
 		goto unlock;
 
-	for (i = 0; i < priv->channels.num * priv->channels.params.num_tc; i++) {
+	for (i = 0; i < netdev->real_num_tx_queues; i++) {
 		struct netdev_queue *dev_queue =
-			netdev_get_tx_queue(priv->netdev, i);
+			netdev_get_tx_queue(netdev, i);
 		struct mlx5e_txqsq *sq = priv->txq2sq[i];
 
 		if (!netif_xmit_stopped(dev_queue))
@@ -5334,10 +5362,14 @@  struct net_device *mlx5e_create_netdev(struct mlx5_core_dev *mdev,
 				       void *ppriv)
 {
 	struct net_device *netdev;
+	unsigned int ptp_txqs = 0;
 	int err;
 
+	if (MLX5_CAP_GEN(mdev, ts_cqe_to_dest_cqn))
+		ptp_txqs = profile->max_tc;
+
 	netdev = alloc_etherdev_mqs(sizeof(struct mlx5e_priv),
-				    nch * profile->max_tc,
+				    nch * profile->max_tc + ptp_txqs,
 				    nch * profile->rq_groups);
 	if (!netdev) {
 		mlx5_core_err(mdev, "alloc_etherdev_mqs() failed\n");
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index ebfb47a09128..9d57dc94c767 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -402,6 +402,24 @@  static void mlx5e_stats_grp_sw_update_stats_sq(struct mlx5e_sw_stats *s,
 	s->tx_cqes                  += sq_stats->cqes;
 }
 
+static void mlx5e_stats_grp_sw_update_stats_ptp(struct mlx5e_priv *priv,
+						struct mlx5e_sw_stats *s)
+{
+	int i;
+
+	if (!priv->port_ptp_opened)
+		return;
+
+	mlx5e_stats_grp_sw_update_stats_ch_stats(s, &priv->port_ptp_stats.ch);
+
+	for (i = 0; i < priv->max_opened_tc; i++) {
+		mlx5e_stats_grp_sw_update_stats_sq(s, &priv->port_ptp_stats.sq[i]);
+
+		/* https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92657 */
+		barrier();
+	}
+}
+
 static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw)
 {
 	struct mlx5e_sw_stats *s = &priv->stats.sw;
@@ -430,6 +448,7 @@  static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw)
 			barrier();
 		}
 	}
+	mlx5e_stats_grp_sw_update_stats_ptp(priv, s);
 }
 
 static const struct counter_desc q_stats_desc[] = {
@@ -1690,6 +1709,30 @@  static const struct counter_desc ch_stats_desc[] = {
 	{ MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, eq_rearm) },
 };
 
+static const struct counter_desc ptp_sq_stats_desc[] = {
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, packets) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, bytes) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, csum_partial) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, csum_partial_inner) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, added_vlan_packets) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, nop) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, csum_none) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, stopped) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, dropped) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, xmit_more) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, recover) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, cqes) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, wake) },
+	{ MLX5E_DECLARE_PTP_TX_STAT(struct mlx5e_sq_stats, cqe_err) },
+};
+
+static const struct counter_desc ptp_ch_stats_desc[] = {
+	{ MLX5E_DECLARE_PTP_CH_STAT(struct mlx5e_ch_stats, events) },
+	{ MLX5E_DECLARE_PTP_CH_STAT(struct mlx5e_ch_stats, poll) },
+	{ MLX5E_DECLARE_PTP_CH_STAT(struct mlx5e_ch_stats, arm) },
+	{ MLX5E_DECLARE_PTP_CH_STAT(struct mlx5e_ch_stats, eq_rearm) },
+};
+
 #define NUM_RQ_STATS			ARRAY_SIZE(rq_stats_desc)
 #define NUM_SQ_STATS			ARRAY_SIZE(sq_stats_desc)
 #define NUM_XDPSQ_STATS			ARRAY_SIZE(xdpsq_stats_desc)
@@ -1697,6 +1740,57 @@  static const struct counter_desc ch_stats_desc[] = {
 #define NUM_XSKRQ_STATS			ARRAY_SIZE(xskrq_stats_desc)
 #define NUM_XSKSQ_STATS			ARRAY_SIZE(xsksq_stats_desc)
 #define NUM_CH_STATS			ARRAY_SIZE(ch_stats_desc)
+#define NUM_PTP_SQ_STATS		ARRAY_SIZE(ptp_sq_stats_desc)
+#define NUM_PTP_CH_STATS		ARRAY_SIZE(ptp_ch_stats_desc)
+
+static MLX5E_DECLARE_STATS_GRP_OP_NUM_STATS(ptp)
+{
+	return priv->port_ptp_opened ?
+	       NUM_PTP_CH_STATS + (NUM_PTP_SQ_STATS * priv->max_opened_tc) :
+	       0;
+}
+
+static MLX5E_DECLARE_STATS_GRP_OP_FILL_STRS(ptp)
+{
+	int i, tc;
+
+	if (!priv->port_ptp_opened)
+		return idx;
+
+	for (i = 0; i < NUM_PTP_CH_STATS; i++)
+		sprintf(data + (idx++) * ETH_GSTRING_LEN,
+			ptp_ch_stats_desc[i].format);
+
+	for (tc = 0; tc < priv->max_opened_tc; tc++)
+		for (i = 0; i < NUM_PTP_SQ_STATS; i++)
+			sprintf(data + (idx++) * ETH_GSTRING_LEN,
+				ptp_sq_stats_desc[i].format, tc);
+
+	return idx;
+}
+
+static MLX5E_DECLARE_STATS_GRP_OP_FILL_STATS(ptp)
+{
+	int i, tc;
+
+	if (!priv->port_ptp_opened)
+		return idx;
+
+	for (i = 0; i < NUM_PTP_CH_STATS; i++)
+		data[idx++] =
+			MLX5E_READ_CTR64_CPU(&priv->port_ptp_stats.ch,
+					     ptp_ch_stats_desc, i);
+
+	for (tc = 0; tc < priv->max_opened_tc; tc++)
+		for (i = 0; i < NUM_PTP_SQ_STATS; i++)
+			data[idx++] =
+				MLX5E_READ_CTR64_CPU(&priv->port_ptp_stats.sq[tc],
+						     ptp_sq_stats_desc, i);
+
+	return idx;
+}
+
+static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(ptp) { return; }
 
 static MLX5E_DECLARE_STATS_GRP_OP_NUM_STATS(channels)
 {
@@ -1818,6 +1912,7 @@  MLX5E_DEFINE_STATS_GRP(channels, 0);
 MLX5E_DEFINE_STATS_GRP(per_port_buff_congest, 0);
 MLX5E_DEFINE_STATS_GRP(eth_ext, 0);
 static MLX5E_DEFINE_STATS_GRP(tls, 0);
+static MLX5E_DEFINE_STATS_GRP(ptp, 0);
 
 /* The stats groups order is opposite to the update_stats() order calls */
 mlx5e_stats_grp_t mlx5e_nic_stats_grps[] = {
@@ -1840,6 +1935,7 @@  mlx5e_stats_grp_t mlx5e_nic_stats_grps[] = {
 	&MLX5E_STATS_GRP(tls),
 	&MLX5E_STATS_GRP(channels),
 	&MLX5E_STATS_GRP(per_port_buff_congest),
+	&MLX5E_STATS_GRP(ptp),
 };
 
 unsigned int mlx5e_nic_stats_grps_num(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 162daaadb0d8..98ffebcc93b9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -51,6 +51,9 @@ 
 #define MLX5E_DECLARE_XSKSQ_STAT(type, fld) "tx%d_xsk_"#fld, offsetof(type, fld)
 #define MLX5E_DECLARE_CH_STAT(type, fld) "ch%d_"#fld, offsetof(type, fld)
 
+#define MLX5E_DECLARE_PTP_TX_STAT(type, fld) "ptp_tx%d_"#fld, offsetof(type, fld)
+#define MLX5E_DECLARE_PTP_CH_STAT(type, fld) "ptp_ch_"#fld, offsetof(type, fld)
+
 struct counter_desc {
 	char		format[ETH_GSTRING_LEN];
 	size_t		offset; /* Byte offset */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index c6b20b77a0f2..0ae68cb25035 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -39,6 +39,7 @@ 
 #include "ipoib/ipoib.h"
 #include "en_accel/en_accel.h"
 #include "lib/clock.h"
+#include "en/ptp.h"
 
 static void mlx5e_dma_unmap_wqe_err(struct mlx5e_txqsq *sq, u8 num_dma)
 {
@@ -66,14 +67,67 @@  static inline int mlx5e_get_dscp_up(struct mlx5e_priv *priv, struct sk_buff *skb
 }
 #endif
 
+static bool mlx5e_use_ptpsq(struct sk_buff *skb)
+{
+	struct flow_keys fk;
+
+	if (!skb_flow_dissect_flow_keys(skb, &fk, 0))
+		return false;
+
+	if (fk.basic.n_proto == htons(ETH_P_1588))
+		return true;
+
+	if (fk.basic.n_proto != htons(ETH_P_IP) &&
+	    fk.basic.n_proto != htons(ETH_P_IPV6))
+		return false;
+
+	return fk.basic.ip_proto == IPPROTO_UDP;
+}
+
+static u16 mlx5e_select_ptpsq(struct net_device *dev, struct sk_buff *skb)
+{
+	struct mlx5e_priv *priv = netdev_priv(dev);
+	int up = 0;
+
+	if (!netdev_get_num_tc(dev))
+		goto return_txq;
+
+#ifdef CONFIG_MLX5_CORE_EN_DCB
+	if (priv->dcbx_dp.trust_state == MLX5_QPTS_TRUST_DSCP)
+		up = mlx5e_get_dscp_up(priv, skb);
+	else
+#endif
+		if (skb_vlan_tag_present(skb))
+			up = skb_vlan_tag_get_prio(skb);
+
+return_txq:
+	return priv->port_ptp_tc2realtxq[up];
+}
+
 u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
 		       struct net_device *sb_dev)
 {
-	int txq_ix = netdev_pick_tx(dev, skb, NULL);
 	struct mlx5e_priv *priv = netdev_priv(dev);
+	int txq_ix;
 	int up = 0;
 	int ch_ix;
 
+	if (unlikely(priv->channels.port_ptp)) {
+		if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) &&
+		    mlx5e_use_ptpsq(skb))
+			return mlx5e_select_ptpsq(dev, skb);
+
+		txq_ix = netdev_pick_tx(dev, skb, NULL);
+		/* Fix netdev_pick_tx() not to choose ptp_channel txqs.
+		 * If they are selected, switch to regular queues.
+		 * Driver to select these queues only at mlx5e_select_ptpsq().
+		 */
+		if (unlikely(txq_ix >= priv->num_tc_x_num_ch))
+			txq_ix = txq_ix % priv->num_tc_x_num_ch;
+	} else {
+		txq_ix = netdev_pick_tx(dev, skb, NULL);
+	}
+
 	if (!netdev_get_num_tc(dev))
 		return txq_ix;