diff mbox series

Use ixgbe_ptp_reset on linkup/linkdown for X550

Message ID 20220801133750.7312-1-achaiken@aurora.tech
State Superseded
Headers show
Series Use ixgbe_ptp_reset on linkup/linkdown for X550 | expand

Commit Message

Alison Chaiken Aug. 1, 2022, 1:37 p.m. UTC
From: Steve Payne <spayne@aurora.tech>

For an unknown reason, when `ixgbe_ptp_start_cyclecounter` is called
from `ixgbe_watchdog_link_is_down` the PHC on the NIC jumps backward
by a seemingly inconsistent amount, which causes discontinuities in
time synchronization. Explicitly reset the NIC's PHC to
`CLOCK_REALTIME` whenever the NIC goes up or down by calling
`ixgbe_ptp_reset` instead of the bare `ixgbe_ptp_start_cyclecounter`.

Signed-off-by: Steve Payne <spayne@aurora.tech>
Signed-off-by: Alison Chaiken <achaiken@aurora.tech>

---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Jacob Keller Aug. 1, 2022, 9:34 p.m. UTC | #1
> -----Original Message-----
> From: achaiken@aurora.tech <achaiken@aurora.tech>
> Sent: Monday, August 01, 2022 6:38 AM
> To: Brandeburg, Jesse <jesse.brandeburg@intel.com>;
> richardcochran@gmail.com
> Cc: spayne@aurora.tech; achaiken@aurora.tech; alison@she-devel.com;
> netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org
> Subject: [PATCH] Use ixgbe_ptp_reset on linkup/linkdown for X550
> 
> From: Steve Payne <spayne@aurora.tech>
> 
> For an unknown reason, when `ixgbe_ptp_start_cyclecounter` is called
> from `ixgbe_watchdog_link_is_down` the PHC on the NIC jumps backward
> by a seemingly inconsistent amount, which causes discontinuities in
> time synchronization. Explicitly reset the NIC's PHC to
> `CLOCK_REALTIME` whenever the NIC goes up or down by calling
> `ixgbe_ptp_reset` instead of the bare `ixgbe_ptp_start_cyclecounter`.
> 
> Signed-off-by: Steve Payne <spayne@aurora.tech>
> Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
> 

Resetting PTP could be a problem if the clock was not being synchronized with the kernel CLOCK_REALTIME, and does result in some loss of timer precision either way due to the delays involved with setting the time.

Do you have an example of the clock jump? How much is it? How often is it? Every time? More information would help in order to debug what is going wrong here.

Thanks,
Jake

> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 750b02bb2fdc2..ab1ec076fa75f 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -7462,7 +7462,7 @@ static void ixgbe_watchdog_link_is_up(struct
> ixgbe_adapter *adapter)
>  	adapter->last_rx_ptp_check = jiffies;
> 
>  	if (test_bit(__IXGBE_PTP_RUNNING, &adapter->state))
> -		ixgbe_ptp_start_cyclecounter(adapter);
> +		ixgbe_ptp_reset(adapter);
> 
>  	switch (link_speed) {
>  	case IXGBE_LINK_SPEED_10GB_FULL:
> @@ -7527,7 +7527,7 @@ static void ixgbe_watchdog_link_is_down(struct
> ixgbe_adapter *adapter)
>  		adapter->flags2 |= IXGBE_FLAG2_SEARCH_FOR_SFP;
> 
>  	if (test_bit(__IXGBE_PTP_RUNNING, &adapter->state))
> -		ixgbe_ptp_start_cyclecounter(adapter);
> +		ixgbe_ptp_reset(adapter);
> 
>  	e_info(drv, "NIC Link is Down\n");
>  	netif_carrier_off(netdev);
> --
> 2.32.0
Ilya Evenbach Aug. 1, 2022, 10:53 p.m. UTC | #2
On Mon, Aug 1, 2022 at 3:16 PM Alison Chaiken <achaiken@aurora.tech> wrote:

>
>
> ---------- Forwarded message ---------
> From: Keller, Jacob E <jacob.e.keller@intel.com>
> Date: Mon, Aug 1, 2022 at 2:34 PM
> Subject: RE: [PATCH] Use ixgbe_ptp_reset on linkup/linkdown for X550
> To: achaiken@aurora.tech <achaiken@aurora.tech>, Brandeburg, Jesse <
> jesse.brandeburg@intel.com>, richardcochran@gmail.com <
> richardcochran@gmail.com>
> Cc: spayne@aurora.tech <spayne@aurora.tech>, alison@she-devel.com <
> alison@she-devel.com>, netdev@vger.kernel.org <netdev@vger.kernel.org>,
> intel-wired-lan@lists.osuosl.org <intel-wired-lan@lists.osuosl.org>
> > -----Original Message-----
> > From: achaiken@aurora.tech <achaiken@aurora.tech>
> > Sent: Monday, August 01, 2022 6:38 AM
> > To: Brandeburg, Jesse <jesse.brandeburg@intel.com>;
> > richardcochran@gmail.com
> > Cc: spayne@aurora.tech; achaiken@aurora.tech; alison@she-devel.com;
> > netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org
> > Subject: [PATCH] Use ixgbe_ptp_reset on linkup/linkdown for X550
> >
> > From: Steve Payne <spayne@aurora.tech>
> >
> > For an unknown reason, when `ixgbe_ptp_start_cyclecounter` is called
> > from `ixgbe_watchdog_link_is_down` the PHC on the NIC jumps backward
> > by a seemingly inconsistent amount, which causes discontinuities in
> > time synchronization. Explicitly reset the NIC's PHC to
> > `CLOCK_REALTIME` whenever the NIC goes up or down by calling
> > `ixgbe_ptp_reset` instead of the bare `ixgbe_ptp_start_cyclecounter`.
> >
> > Signed-off-by: Steve Payne <spayne@aurora.tech>
> > Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
> >
>
> Resetting PTP could be a problem if the clock was not being synchronized
> with the kernel CLOCK_REALTIME,
>
That is true, but most likely not really important, as the unmitigated
problem also introduces significant discontinuities in time.
Basically, this patch does not make things worse.


> and does result in some loss of timer precision either way due to the
> delays involved with setting the time.
>
 That precision loss is negligible compared to jumps resulting from link
down/up, and should be corrected by normal PTP operation very quickly.


> Do you have an example of the clock jump? How much is it?
>
2021-02-12T09:24:37.741191+00:00 bench-12 phc2sys: [195230.451]
CLOCK_REALTIME phc offset        61 s2 freq  -36503 delay   2298
2021-02-12T09:24:38.741315+00:00 bench-12 phc2sys: [195231.451]
CLOCK_REALTIME phc offset       169 s2 freq  -36377 delay   2294
2021-02-12T09:24:39.741407+00:00 bench-12 phc2sys: [195232.451]
CLOCK_REALTIME phc offset 195213702387037 s2 freq +100000000 delay   2301
2021-02-12T09:24:40.741489+00:00 bench-12 phc2sys: [195233.452]
CLOCK_REALTIME phc offset 195213591220495 s2 freq +100000000 delay   2081


> How often is it? Every time?
>
Every time (though the specific amount differs, it is usually at similar
magnitude)

More information would help in order to debug what is going wrong here.
>
> Thanks,
> Jake
>
> > ---
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > index 750b02bb2fdc2..ab1ec076fa75f 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > @@ -7462,7 +7462,7 @@ static void ixgbe_watchdog_link_is_up(struct
> > ixgbe_adapter *adapter)
> >       adapter->last_rx_ptp_check = jiffies;
> >
> >       if (test_bit(__IXGBE_PTP_RUNNING, &adapter->state))
> > -             ixgbe_ptp_start_cyclecounter(adapter);
> > +             ixgbe_ptp_reset(adapter);
> >
> >       switch (link_speed) {
> >       case IXGBE_LINK_SPEED_10GB_FULL:
> > @@ -7527,7 +7527,7 @@ static void ixgbe_watchdog_link_is_down(struct
> > ixgbe_adapter *adapter)
> >               adapter->flags2 |= IXGBE_FLAG2_SEARCH_FOR_SFP;
> >
> >       if (test_bit(__IXGBE_PTP_RUNNING, &adapter->state))
> > -             ixgbe_ptp_start_cyclecounter(adapter);
> > +             ixgbe_ptp_reset(adapter);
> >
> >       e_info(drv, "NIC Link is Down\n");
> >       netif_carrier_off(netdev);
> > --
> > 2.32.0
>
>
Ilya Evenbach Aug. 1, 2022, 11 p.m. UTC | #3
> > -----Original Message-----
> > From: achaiken@aurora.tech <achaiken@aurora.tech>
> > Sent: Monday, August 01, 2022 6:38 AM
> > To: Brandeburg, Jesse <jesse.brandeburg@intel.com>;
> > richardcochran@gmail.com
> > Cc: spayne@aurora.tech; achaiken@aurora.tech; alison@she-devel.com;
> > netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org
> > Subject: [PATCH] Use ixgbe_ptp_reset on linkup/linkdown for X550
> >
> > From: Steve Payne <spayne@aurora.tech>
> >
> > For an unknown reason, when `ixgbe_ptp_start_cyclecounter` is called
> > from `ixgbe_watchdog_link_is_down` the PHC on the NIC jumps backward
> > by a seemingly inconsistent amount, which causes discontinuities in
> > time synchronization. Explicitly reset the NIC's PHC to
> > `CLOCK_REALTIME` whenever the NIC goes up or down by calling
> > `ixgbe_ptp_reset` instead of the bare `ixgbe_ptp_start_cyclecounter`.
> >
> > Signed-off-by: Steve Payne <spayne@aurora.tech>
> > Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
> >
>
> Resetting PTP could be a problem if the clock was not being synchronized with the kernel CLOCK_REALTIME,

That is true, but most likely not really important, as the unmitigated
problem also introduces significant discontinuities in time.
Basically, this patch does not make things worse.

>
> and does result in some loss of timer precision either way due to the delays involved with setting the time.

 That precision loss is negligible compared to jumps resulting from
link down/up, and should be corrected by normal PTP operation very
quickly.

>
> Do you have an example of the clock jump? How much is it?

2021-02-12T09:24:37.741191+00:00 bench-12 phc2sys: [195230.451]
CLOCK_REALTIME phc offset        61 s2 freq  -36503 delay   2298
2021-02-12T09:24:38.741315+00:00 bench-12 phc2sys: [195231.451]
CLOCK_REALTIME phc offset       169 s2 freq  -36377 delay   2294
2021-02-12T09:24:39.741407+00:00 bench-12 phc2sys: [195232.451]
CLOCK_REALTIME phc offset 195213702387037 s2 freq +100000000 delay
2301
2021-02-12T09:24:40.741489+00:00 bench-12 phc2sys: [195233.452]
CLOCK_REALTIME phc offset 195213591220495 s2 freq +100000000 delay
2081

>
> How often is it? Every time?

Every time (though the specific amount differs, it is usually at
similar magnitude)

> More information would help in order to debug what is going wrong here.
>
> Thanks,
> Jake
>
> > ---
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > index 750b02bb2fdc2..ab1ec076fa75f 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > @@ -7462,7 +7462,7 @@ static void ixgbe_watchdog_link_is_up(struct
> > ixgbe_adapter *adapter)
> >       adapter->last_rx_ptp_check = jiffies;
> >
> >       if (test_bit(__IXGBE_PTP_RUNNING, &adapter->state))
> > -             ixgbe_ptp_start_cyclecounter(adapter);
> > +             ixgbe_ptp_reset(adapter);
> >
> >       switch (link_speed) {
> >       case IXGBE_LINK_SPEED_10GB_FULL:
> > @@ -7527,7 +7527,7 @@ static void ixgbe_watchdog_link_is_down(struct
> > ixgbe_adapter *adapter)
> >               adapter->flags2 |= IXGBE_FLAG2_SEARCH_FOR_SFP;
> >
> >       if (test_bit(__IXGBE_PTP_RUNNING, &adapter->state))
> > -             ixgbe_ptp_start_cyclecounter(adapter);
> > +             ixgbe_ptp_reset(adapter);
> >
> >       e_info(drv, "NIC Link is Down\n");
> >       netif_carrier_off(netdev);
> > --
> > 2.32.0
>
Jacob Keller Aug. 1, 2022, 11:29 p.m. UTC | #4
On 8/1/2022 4:00 PM, Ilya Evenbach wrote:
>>> -----Original Message-----
>>> From: achaiken@aurora.tech <achaiken@aurora.tech>
>>> Sent: Monday, August 01, 2022 6:38 AM
>>> To: Brandeburg, Jesse <jesse.brandeburg@intel.com>;
>>> richardcochran@gmail.com
>>> Cc: spayne@aurora.tech; achaiken@aurora.tech; alison@she-devel.com;
>>> netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org
>>> Subject: [PATCH] Use ixgbe_ptp_reset on linkup/linkdown for X550
>>>
>>> From: Steve Payne <spayne@aurora.tech>
>>>
>>> For an unknown reason, when `ixgbe_ptp_start_cyclecounter` is called
>>> from `ixgbe_watchdog_link_is_down` the PHC on the NIC jumps backward
>>> by a seemingly inconsistent amount, which causes discontinuities in
>>> time synchronization. Explicitly reset the NIC's PHC to
>>> `CLOCK_REALTIME` whenever the NIC goes up or down by calling
>>> `ixgbe_ptp_reset` instead of the bare `ixgbe_ptp_start_cyclecounter`.
>>>
>>> Signed-off-by: Steve Payne <spayne@aurora.tech>
>>> Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
>>>
>>
>> Resetting PTP could be a problem if the clock was not being synchronized with the kernel CLOCK_REALTIME,
> 
> That is true, but most likely not really important, as the unmitigated
> problem also introduces significant discontinuities in time.
> Basically, this patch does not make things worse.
> 

Sure, but I am trying to see if I can understand *why* things get wonky.
I suspect the issue is caused because of how we're resetting the
cyclecounter.

>>
>> and does result in some loss of timer precision either way due to the delays involved with setting the time.
> 
>  That precision loss is negligible compared to jumps resulting from
> link down/up, and should be corrected by normal PTP operation very
> quickly.
> 

Only if CLOCK_REALTIME is actually being synchronized. Yes, that is
generally true, but its not necessarily guaranteed.

>>
>> Do you have an example of the clock jump? How much is it?
> 
> 2021-02-12T09:24:37.741191+00:00 bench-12 phc2sys: [195230.451]
> CLOCK_REALTIME phc offset        61 s2 freq  -36503 delay   2298
> 2021-02-12T09:24:38.741315+00:00 bench-12 phc2sys: [195231.451]
> CLOCK_REALTIME phc offset       169 s2 freq  -36377 delay   2294
> 2021-02-12T09:24:39.741407+00:00 bench-12 phc2sys: [195232.451]
> CLOCK_REALTIME phc offset 195213702387037 s2 freq +100000000 delay
> 2301
> 2021-02-12T09:24:40.741489+00:00 bench-12 phc2sys: [195233.452]
> CLOCK_REALTIME phc offset 195213591220495 s2 freq +100000000 delay
> 2081
> 

Thanks.

I think what's actually going on is a bug in the
ixgbe_ptp_start_cyclecounter function where the system time registers
are being reset.

What hardware are you operating on? Do you know if its an X550 board? It
looks like this has been the case since a9763f3cb54c ("ixgbe: Update PTP
to support X550EM_x devices").

The start_cyclecounter was never supposed to modify the current time
registers, but resetting it to 0 as it does for X550 devices would give
the exact behavior you're seeing.
Alison Chaiken Aug. 2, 2022, 12:24 a.m. UTC | #5
On Mon, Aug 1, 2022 at 4:29 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
>
>
>
> On 8/1/2022 4:00 PM, Ilya Evenbach wrote:
> >>> -----Original Message-----
> >>> From: achaiken@aurora.tech <achaiken@aurora.tech>
> >>> Sent: Monday, August 01, 2022 6:38 AM
> >>> To: Brandeburg, Jesse <jesse.brandeburg@intel.com>;
> >>> richardcochran@gmail.com
> >>> Cc: spayne@aurora.tech; achaiken@aurora.tech; alison@she-devel.com;
> >>> netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org
> >>> Subject: [PATCH] Use ixgbe_ptp_reset on linkup/linkdown for X550
> >>>
> >>> From: Steve Payne <spayne@aurora.tech>
> >>>
> >>> For an unknown reason, when `ixgbe_ptp_start_cyclecounter` is called
> >>> from `ixgbe_watchdog_link_is_down` the PHC on the NIC jumps backward
> >>> by a seemingly inconsistent amount, which causes discontinuities in
> >>> time synchronization. Explicitly reset the NIC's PHC to
> >>> `CLOCK_REALTIME` whenever the NIC goes up or down by calling
> >>> `ixgbe_ptp_reset` instead of the bare `ixgbe_ptp_start_cyclecounter`.
> >>>
> >>> Signed-off-by: Steve Payne <spayne@aurora.tech>
> >>> Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
> >>>
> >>
> >> Resetting PTP could be a problem if the clock was not being synchronized with the kernel CLOCK_REALTIME,
> >
> > That is true, but most likely not really important, as the unmitigated
> > problem also introduces significant discontinuities in time.
> > Basically, this patch does not make things worse.
> >
>
> Sure, but I am trying to see if I can understand *why* things get wonky.
> I suspect the issue is caused because of how we're resetting the
> cyclecounter.
>
> >>
> >> and does result in some loss of timer precision either way due to the delays involved with setting the time.
> >
> >  That precision loss is negligible compared to jumps resulting from
> > link down/up, and should be corrected by normal PTP operation very
> > quickly.
> >
>
> Only if CLOCK_REALTIME is actually being synchronized. Yes, that is
> generally true, but its not necessarily guaranteed.
>
> >>
> >> Do you have an example of the clock jump? How much is it?
> >
> > 2021-02-12T09:24:37.741191+00:00 bench-12 phc2sys: [195230.451]
> > CLOCK_REALTIME phc offset        61 s2 freq  -36503 delay   2298
> > 2021-02-12T09:24:38.741315+00:00 bench-12 phc2sys: [195231.451]
> > CLOCK_REALTIME phc offset       169 s2 freq  -36377 delay   2294
> > 2021-02-12T09:24:39.741407+00:00 bench-12 phc2sys: [195232.451]
> > CLOCK_REALTIME phc offset 195213702387037 s2 freq +100000000 delay
> > 2301
> > 2021-02-12T09:24:40.741489+00:00 bench-12 phc2sys: [195233.452]
> > CLOCK_REALTIME phc offset 195213591220495 s2 freq +100000000 delay
> > 2081
> >
>
> Thanks.
>
> I think what's actually going on is a bug in the
> ixgbe_ptp_start_cyclecounter function where the system time registers
> are being reset.
>
> What hardware are you operating on? Do you know if its an X550 board?

Indeed it is.

> It
> looks like this has been the case since a9763f3cb54c ("ixgbe: Update PTP
> to support X550EM_x devices").

The current test results come from v5.15.49-rt47. We observed the same
problem in 5.4.93-rt51, which contains a9763f3cb54c.

> The start_cyclecounter was never supposed to modify the current time
> registers, but resetting it to 0 as it does for X550 devices would give
> the exact behavior you're seeing.

That certainly sounds plausible.

Thanks,
Alison Chaiken
Aurora Innovation
Jacob Keller Aug. 2, 2022, 12:26 a.m. UTC | #6
On 8/1/2022 4:29 PM, Jacob Keller wrote:
> 
> 
> On 8/1/2022 4:00 PM, Ilya Evenbach wrote:
>>>> -----Original Message-----
>>>> From: achaiken@aurora.tech <achaiken@aurora.tech>
>>>> Sent: Monday, August 01, 2022 6:38 AM
>>>> To: Brandeburg, Jesse <jesse.brandeburg@intel.com>;
>>>> richardcochran@gmail.com
>>>> Cc: spayne@aurora.tech; achaiken@aurora.tech; alison@she-devel.com;
>>>> netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org
>>>> Subject: [PATCH] Use ixgbe_ptp_reset on linkup/linkdown for X550
>>>>
>>>> From: Steve Payne <spayne@aurora.tech>
>>>>
>>>> For an unknown reason, when `ixgbe_ptp_start_cyclecounter` is called
>>>> from `ixgbe_watchdog_link_is_down` the PHC on the NIC jumps backward
>>>> by a seemingly inconsistent amount, which causes discontinuities in
>>>> time synchronization. Explicitly reset the NIC's PHC to
>>>> `CLOCK_REALTIME` whenever the NIC goes up or down by calling
>>>> `ixgbe_ptp_reset` instead of the bare `ixgbe_ptp_start_cyclecounter`.
>>>>
>>>> Signed-off-by: Steve Payne <spayne@aurora.tech>
>>>> Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
>>>>
>>>
>>> Resetting PTP could be a problem if the clock was not being synchronized with the kernel CLOCK_REALTIME,
>>
>> That is true, but most likely not really important, as the unmitigated
>> problem also introduces significant discontinuities in time.
>> Basically, this patch does not make things worse.
>>
> 
> Sure, but I am trying to see if I can understand *why* things get wonky.
> I suspect the issue is caused because of how we're resetting the
> cyclecounter.
> 
>>>
>>> and does result in some loss of timer precision either way due to the delays involved with setting the time.
>>
>>  That precision loss is negligible compared to jumps resulting from
>> link down/up, and should be corrected by normal PTP operation very
>> quickly.
>>
> 
> Only if CLOCK_REALTIME is actually being synchronized. Yes, that is
> generally true, but its not necessarily guaranteed.
> 
>>>
>>> Do you have an example of the clock jump? How much is it?
>>
>> 2021-02-12T09:24:37.741191+00:00 bench-12 phc2sys: [195230.451]
>> CLOCK_REALTIME phc offset        61 s2 freq  -36503 delay   2298
>> 2021-02-12T09:24:38.741315+00:00 bench-12 phc2sys: [195231.451]
>> CLOCK_REALTIME phc offset       169 s2 freq  -36377 delay   2294
>> 2021-02-12T09:24:39.741407+00:00 bench-12 phc2sys: [195232.451]
>> CLOCK_REALTIME phc offset 195213702387037 s2 freq +100000000 delay
>> 2301
>> 2021-02-12T09:24:40.741489+00:00 bench-12 phc2sys: [195233.452]
>> CLOCK_REALTIME phc offset 195213591220495 s2 freq +100000000 delay
>> 2081
>>
> 
> Thanks.
> 
> I think what's actually going on is a bug in the
> ixgbe_ptp_start_cyclecounter function where the system time registers
> are being reset.
> 
> What hardware are you operating on? Do you know if its an X550 board? It
> looks like this has been the case since a9763f3cb54c ("ixgbe: Update PTP
> to support X550EM_x devices").
> 
> The start_cyclecounter was never supposed to modify the current time
> registers, but resetting it to 0 as it does for X550 devices would give
> the exact behavior you're seeing.

I just posted an alternative fix which I believe resolves this issue.

Thanks,
Jake
Jacob Keller Aug. 2, 2022, 12:38 a.m. UTC | #7
On 8/1/2022 5:24 PM, Alison Chaiken wrote:
> On Mon, Aug 1, 2022 at 4:29 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
>>
>>
>>
>> On 8/1/2022 4:00 PM, Ilya Evenbach wrote:
>>>>> -----Original Message-----
>>>>> From: achaiken@aurora.tech <achaiken@aurora.tech>
>>>>> Sent: Monday, August 01, 2022 6:38 AM
>>>>> To: Brandeburg, Jesse <jesse.brandeburg@intel.com>;
>>>>> richardcochran@gmail.com
>>>>> Cc: spayne@aurora.tech; achaiken@aurora.tech; alison@she-devel.com;
>>>>> netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org
>>>>> Subject: [PATCH] Use ixgbe_ptp_reset on linkup/linkdown for X550
>>>>>
>>>>> From: Steve Payne <spayne@aurora.tech>
>>>>>
>>>>> For an unknown reason, when `ixgbe_ptp_start_cyclecounter` is called
>>>>> from `ixgbe_watchdog_link_is_down` the PHC on the NIC jumps backward
>>>>> by a seemingly inconsistent amount, which causes discontinuities in
>>>>> time synchronization. Explicitly reset the NIC's PHC to
>>>>> `CLOCK_REALTIME` whenever the NIC goes up or down by calling
>>>>> `ixgbe_ptp_reset` instead of the bare `ixgbe_ptp_start_cyclecounter`.
>>>>>
>>>>> Signed-off-by: Steve Payne <spayne@aurora.tech>
>>>>> Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
>>>>>
>>>>
>>>> Resetting PTP could be a problem if the clock was not being synchronized with the kernel CLOCK_REALTIME,
>>>
>>> That is true, but most likely not really important, as the unmitigated
>>> problem also introduces significant discontinuities in time.
>>> Basically, this patch does not make things worse.
>>>
>>
>> Sure, but I am trying to see if I can understand *why* things get wonky.
>> I suspect the issue is caused because of how we're resetting the
>> cyclecounter.
>>
>>>>
>>>> and does result in some loss of timer precision either way due to the delays involved with setting the time.
>>>
>>>  That precision loss is negligible compared to jumps resulting from
>>> link down/up, and should be corrected by normal PTP operation very
>>> quickly.
>>>
>>
>> Only if CLOCK_REALTIME is actually being synchronized. Yes, that is
>> generally true, but its not necessarily guaranteed.
>>
>>>>
>>>> Do you have an example of the clock jump? How much is it?
>>>
>>> 2021-02-12T09:24:37.741191+00:00 bench-12 phc2sys: [195230.451]
>>> CLOCK_REALTIME phc offset        61 s2 freq  -36503 delay   2298
>>> 2021-02-12T09:24:38.741315+00:00 bench-12 phc2sys: [195231.451]
>>> CLOCK_REALTIME phc offset       169 s2 freq  -36377 delay   2294
>>> 2021-02-12T09:24:39.741407+00:00 bench-12 phc2sys: [195232.451]
>>> CLOCK_REALTIME phc offset 195213702387037 s2 freq +100000000 delay
>>> 2301
>>> 2021-02-12T09:24:40.741489+00:00 bench-12 phc2sys: [195233.452]
>>> CLOCK_REALTIME phc offset 195213591220495 s2 freq +100000000 delay
>>> 2081
>>>
>>
>> Thanks.
>>
>> I think what's actually going on is a bug in the
>> ixgbe_ptp_start_cyclecounter function where the system time registers
>> are being reset.
>>
>> What hardware are you operating on? Do you know if its an X550 board?
> 
> Indeed it is.
> 
>> It
>> looks like this has been the case since a9763f3cb54c ("ixgbe: Update PTP
>> to support X550EM_x devices").
> 
> The current test results come from v5.15.49-rt47. We observed the same
> problem in 5.4.93-rt51, which contains a9763f3cb54c.
> 
>> The start_cyclecounter was never supposed to modify the current time
>> registers, but resetting it to 0 as it does for X550 devices would give
>> the exact behavior you're seeing.
> 
> That certainly sounds plausible.
> 
> Thanks,
> Alison Chaiken
> Aurora Innovation

I just posted a fix which moves the SYSTIME clearing out of
start_cyclecounter and into ixgbe_ptp_reset. I'm fairly confident that
its the correct fix, based on the function comments. I think the
implementor for the X550 simply didn't understand the separation of
ixgbe_ptp_start_cyclecouter and ixgbe_ptp_reset.

Thanks,
Jake
Alison Chaiken Oct. 4, 2022, 6:14 p.m. UTC | #8
How about this Intel X550 PTP fix for 6.1?

https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20220801/029590.html

Thanks,
Alison Chaiken
Aurora Innovation

On Mon, Aug 1, 2022 at 5:26 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
>
> On 8/1/2022 4:29 PM, Jacob Keller wrote:
> >
> >
> > On 8/1/2022 4:00 PM, Ilya Evenbach wrote:
> >>>> -----Original Message-----
> >>>> From: achaiken@aurora.tech <achaiken@aurora.tech>
> >>>> Sent: Monday, August 01, 2022 6:38 AM
> >>>> To: Brandeburg, Jesse <jesse.brandeburg@intel.com>;
> >>>> richardcochran@gmail.com
> >>>> Cc: spayne@aurora.tech; achaiken@aurora.tech; alison@she-devel.com;
> >>>> netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org
> >>>> Subject: [PATCH] Use ixgbe_ptp_reset on linkup/linkdown for X550
> >>>>
> >>>> From: Steve Payne <spayne@aurora.tech>
> >>>>
> >>>> For an unknown reason, when `ixgbe_ptp_start_cyclecounter` is called
> >>>> from `ixgbe_watchdog_link_is_down` the PHC on the NIC jumps backward
> >>>> by a seemingly inconsistent amount, which causes discontinuities in
> >>>> time synchronization. Explicitly reset the NIC's PHC to
> >>>> `CLOCK_REALTIME` whenever the NIC goes up or down by calling
> >>>> `ixgbe_ptp_reset` instead of the bare `ixgbe_ptp_start_cyclecounter`.
> >>>>
> >>>> Signed-off-by: Steve Payne <spayne@aurora.tech>
> >>>> Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
> >>>>
> >>>
> >>> Resetting PTP could be a problem if the clock was not being synchronized with the kernel CLOCK_REALTIME,
> >>
> >> That is true, but most likely not really important, as the unmitigated
> >> problem also introduces significant discontinuities in time.
> >> Basically, this patch does not make things worse.
> >>
> >
> > Sure, but I am trying to see if I can understand *why* things get wonky.
> > I suspect the issue is caused because of how we're resetting the
> > cyclecounter.
> >
> >>>
> >>> and does result in some loss of timer precision either way due to the delays involved with setting the time.
> >>
> >>  That precision loss is negligible compared to jumps resulting from
> >> link down/up, and should be corrected by normal PTP operation very
> >> quickly.
> >>
> >
> > Only if CLOCK_REALTIME is actually being synchronized. Yes, that is
> > generally true, but its not necessarily guaranteed.
> >
> >>>
> >>> Do you have an example of the clock jump? How much is it?
> >>
> >> 2021-02-12T09:24:37.741191+00:00 bench-12 phc2sys: [195230.451]
> >> CLOCK_REALTIME phc offset        61 s2 freq  -36503 delay   2298
> >> 2021-02-12T09:24:38.741315+00:00 bench-12 phc2sys: [195231.451]
> >> CLOCK_REALTIME phc offset       169 s2 freq  -36377 delay   2294
> >> 2021-02-12T09:24:39.741407+00:00 bench-12 phc2sys: [195232.451]
> >> CLOCK_REALTIME phc offset 195213702387037 s2 freq +100000000 delay
> >> 2301
> >> 2021-02-12T09:24:40.741489+00:00 bench-12 phc2sys: [195233.452]
> >> CLOCK_REALTIME phc offset 195213591220495 s2 freq +100000000 delay
> >> 2081
> >>
> >
> > Thanks.
> >
> > I think what's actually going on is a bug in the
> > ixgbe_ptp_start_cyclecounter function where the system time registers
> > are being reset.
> >
> > What hardware are you operating on? Do you know if its an X550 board? It
> > looks like this has been the case since a9763f3cb54c ("ixgbe: Update PTP
> > to support X550EM_x devices").
> >
> > The start_cyclecounter was never supposed to modify the current time
> > registers, but resetting it to 0 as it does for X550 devices would give
> > the exact behavior you're seeing.
>
> I just posted an alternative fix which I believe resolves this issue.
>
> Thanks,
> Jake
Jacob Keller Oct. 4, 2022, 9:08 p.m. UTC | #9
On 10/4/2022 11:14 AM, Alison Chaiken wrote:
> How about this Intel X550 PTP fix for 6.1?
> 
> https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20220801/029590.html
> 
> Thanks,
> Alison Chaiken
> Aurora Innovation
> 


This was already submitted to net as [1], and it looks like it's already
applied to the stable Linux 5.19, 5.15, and 5.10 stable trees.

I've confirmed that it's in the v6.0 tag as commit 25d7a5f5a6bb ("ixgbe:
stop resetting SYSTIME in ixgbe_ptp_start_cyclecounter")

[1]
https://lore.kernel.org/netdev/20220824193748.874343-2-anthony.l.nguyen@intel.com/

Hope that helps!

Thanks,
Jake
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 750b02bb2fdc2..ab1ec076fa75f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7462,7 +7462,7 @@  static void ixgbe_watchdog_link_is_up(struct ixgbe_adapter *adapter)
 	adapter->last_rx_ptp_check = jiffies;
 
 	if (test_bit(__IXGBE_PTP_RUNNING, &adapter->state))
-		ixgbe_ptp_start_cyclecounter(adapter);
+		ixgbe_ptp_reset(adapter);
 
 	switch (link_speed) {
 	case IXGBE_LINK_SPEED_10GB_FULL:
@@ -7527,7 +7527,7 @@  static void ixgbe_watchdog_link_is_down(struct ixgbe_adapter *adapter)
 		adapter->flags2 |= IXGBE_FLAG2_SEARCH_FOR_SFP;
 
 	if (test_bit(__IXGBE_PTP_RUNNING, &adapter->state))
-		ixgbe_ptp_start_cyclecounter(adapter);
+		ixgbe_ptp_reset(adapter);
 
 	e_info(drv, "NIC Link is Down\n");
 	netif_carrier_off(netdev);