diff mbox

[U-Boot,RFC] lib/timer: initialize timebase_l/timebase_h

Message ID 1477356712-7015-1-git-send-email-andre.przywara@arm.com
State RFC
Delegated to: Tom Rini
Headers show

Commit Message

Andre Przywara Oct. 25, 2016, 12:51 a.m. UTC
On systems using the generic timer routines defined in lib/time.c we
use timebase_l and timebase_h fields from the gd to detect wraparounds
in our tick counter. The tick calculcation algorithm silently assumes
that a long is only 32 bits, which leads to wrong results when timebase_h
is not 0 on 64-bit systems.
As one possible fix lets initialize timebase_h (and timebase_l) to 0, so
on 64-bit systems timebase_h will never(TM) be bigger than 0 and thus
cannot spoil timebase_l by being ORed into it.

This fixes occasional timeout issues on the Pine64 (and possibly other
ARM64 systems).

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
---
Hi,

I am bit puzzled what the proper fix is, this one is the easiest I could
come up with. Not sure if the gd should be zeroed normally (and it's just
broken on sunxi/arm64 because of some linker issues) or whether we really
forgot to initialize those fields and just got away with it.
Other fixes I tried are implementing get_ticks() in armv8/generic_timer.c
or fixing the generic get_ticks() implementation by using
"#if BITS_PER_LONG < 64" in the function to shortcut the code on 64-bit
systems.
Please have your say on what's best.

Cheers,
Andre.

P.S. "Never" above seems to be 195 years assuming a worst-case 3GHz
arch timer.

 lib/time.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Alexander Graf Oct. 25, 2016, 7:52 a.m. UTC | #1
On 25/10/2016 02:51, Andre Przywara wrote:
> On systems using the generic timer routines defined in lib/time.c we
> use timebase_l and timebase_h fields from the gd to detect wraparounds
> in our tick counter. The tick calculcation algorithm silently assumes
> that a long is only 32 bits, which leads to wrong results when timebase_h
> is not 0 on 64-bit systems.
> As one possible fix lets initialize timebase_h (and timebase_l) to 0, so
> on 64-bit systems timebase_h will never(TM) be bigger than 0 and thus
> cannot spoil timebase_l by being ORed into it.
> 
> This fixes occasional timeout issues on the Pine64 (and possibly other
> ARM64 systems).
> 
> Signed-off-by: Andre Przywara <andre.przywara@arm.com>
> ---
> Hi,
> 
> I am bit puzzled what the proper fix is, this one is the easiest I could
> come up with. Not sure if the gd should be zeroed normally (and it's just
> broken on sunxi/arm64 because of some linker issues) or whether we really
> forgot to initialize those fields and just got away with it.

The gd clearing happens via crt0_64.S -> board_init_f_init_reserve().
There we should have fully cleared all memory associated with global data.

I can't see anything obviously wrong in that code. Could you try to dump
gd if the timer offsets are != 0 on init? Maybe we can conclude
something from the data in it.


Alex
Andre Przywara Oct. 26, 2016, 12:07 a.m. UTC | #2
On 25/10/16 08:52, Alexander Graf wrote:

Hi Alex,

thanks for looking at this!

> 
> On 25/10/2016 02:51, Andre Przywara wrote:
>> On systems using the generic timer routines defined in lib/time.c we
>> use timebase_l and timebase_h fields from the gd to detect wraparounds
>> in our tick counter. The tick calculcation algorithm silently assumes
>> that a long is only 32 bits, which leads to wrong results when timebase_h
>> is not 0 on 64-bit systems.
>> As one possible fix lets initialize timebase_h (and timebase_l) to 0, so
>> on 64-bit systems timebase_h will never(TM) be bigger than 0 and thus
>> cannot spoil timebase_l by being ORed into it.
>>
>> This fixes occasional timeout issues on the Pine64 (and possibly other
>> ARM64 systems).

Well, not really (this fix isn't sufficient), but read on ...

>> Signed-off-by: Andre Przywara <andre.przywara@arm.com>
>> ---
>> Hi,
>>
>> I am bit puzzled what the proper fix is, this one is the easiest I could
>> come up with. Not sure if the gd should be zeroed normally (and it's just
>> broken on sunxi/arm64 because of some linker issues) or whether we really
>> forgot to initialize those fields and just got away with it.
> 
> The gd clearing happens via crt0_64.S -> board_init_f_init_reserve().
> There we should have fully cleared all memory associated with global data.

Ah,thanks for pointing to that. I was a bit clueless where to start
looking for it - "git grep gd" is obviously not a good idea ;-)

> I can't see anything obviously wrong in that code. Could you try to dump
> gd if the timer offsets are != 0 on init? Maybe we can conclude
> something from the data in it.

So I agree that this code looks sane and indeed in all my dumps it looks
like the initialization works fine.
I did some more debugging and learned that the increased timebase_h
comes from detected overflows: in fact some readings are really lower
than previous ones:
...
MMC:   SUNXI SD/MMC: 0
get_ticks() overflow: now: 118046720, tbl: 118063103, tbh: 0
get_ticks() overflow: now: 118439936, tbl: 118456318, tbh: 1
get_ticks() overflow: now: 118571008, tbl: 118587391, tbh: 2
get_ticks() overflow: now: 118734848, tbl: 118751230, tbh: 3
get_ticks() overflow: now: 119422976, tbl: 119439358, tbh: 4
get_ticks() overflow: now: 119783424, tbl: 119799807, tbh: 5
get_ticks() overflow: now: 120045568, tbl: 120061950, tbh: 6
get_ticks() overflow: now: 120406016, tbl: 120422398, tbh: 7
*** Warning - bad CRC, using default environment
......
Not sure how this actually happens - I am not aware of any such severe
hardware errata in the A53 r0p4 or the timer, also I think that would
have bitten us elsewhere already.
As ATF keeps the secondaries in WFI, U-Boot only runs on CPU0 (confirmed
by MPIDR reads).
Also U-Boot reads the physical counter, so a bogus CNTVOFF can also not
be the culprit.

So I can fix this annoying issue by using one of the other proposed
fixes (handling timebase_h only if BITS_PER_LONG < 64 or defining
get_ticks in armv8/generic_timer.c), but it would still be interesting
to find the real root cause.

Thanks,
Andre.
Alexander Graf Oct. 26, 2016, 7:14 a.m. UTC | #3
On 10/26/2016 02:07 AM, André Przywara wrote:
> On 25/10/16 08:52, Alexander Graf wrote:
>
> Hi Alex,
>
> thanks for looking at this!
>
>> On 25/10/2016 02:51, Andre Przywara wrote:
>>> On systems using the generic timer routines defined in lib/time.c we
>>> use timebase_l and timebase_h fields from the gd to detect wraparounds
>>> in our tick counter. The tick calculcation algorithm silently assumes
>>> that a long is only 32 bits, which leads to wrong results when timebase_h
>>> is not 0 on 64-bit systems.
>>> As one possible fix lets initialize timebase_h (and timebase_l) to 0, so
>>> on 64-bit systems timebase_h will never(TM) be bigger than 0 and thus
>>> cannot spoil timebase_l by being ORed into it.
>>>
>>> This fixes occasional timeout issues on the Pine64 (and possibly other
>>> ARM64 systems).
> Well, not really (this fix isn't sufficient), but read on ...
>
>>> Signed-off-by: Andre Przywara <andre.przywara@arm.com>
>>> ---
>>> Hi,
>>>
>>> I am bit puzzled what the proper fix is, this one is the easiest I could
>>> come up with. Not sure if the gd should be zeroed normally (and it's just
>>> broken on sunxi/arm64 because of some linker issues) or whether we really
>>> forgot to initialize those fields and just got away with it.
>> The gd clearing happens via crt0_64.S -> board_init_f_init_reserve().
>> There we should have fully cleared all memory associated with global data.
> Ah,thanks for pointing to that. I was a bit clueless where to start
> looking for it - "git grep gd" is obviously not a good idea ;-)
>
>> I can't see anything obviously wrong in that code. Could you try to dump
>> gd if the timer offsets are != 0 on init? Maybe we can conclude
>> something from the data in it.
> So I agree that this code looks sane and indeed in all my dumps it looks
> like the initialization works fine.
> I did some more debugging and learned that the increased timebase_h
> comes from detected overflows: in fact some readings are really lower
> than previous ones:
> ...
> MMC:   SUNXI SD/MMC: 0
> get_ticks() overflow: now: 118046720, tbl: 118063103, tbh: 0
> get_ticks() overflow: now: 118439936, tbl: 118456318, tbh: 1
> get_ticks() overflow: now: 118571008, tbl: 118587391, tbh: 2
> get_ticks() overflow: now: 118734848, tbl: 118751230, tbh: 3
> get_ticks() overflow: now: 119422976, tbl: 119439358, tbh: 4
> get_ticks() overflow: now: 119783424, tbl: 119799807, tbh: 5
> get_ticks() overflow: now: 120045568, tbl: 120061950, tbh: 6
> get_ticks() overflow: now: 120406016, tbl: 120422398, tbh: 7
> *** Warning - bad CRC, using default environment
> ......
> Not sure how this actually happens - I am not aware of any such severe
> hardware errata in the A53 r0p4 or the timer, also I think that would
> have bitten us elsewhere already.
> As ATF keeps the secondaries in WFI, U-Boot only runs on CPU0 (confirmed
> by MPIDR reads).
> Also U-Boot reads the physical counter, so a bogus CNTVOFF can also not
> be the culprit.
>
> So I can fix this annoying issue by using one of the other proposed
> fixes (handling timebase_h only if BITS_PER_LONG < 64 or defining
> get_ticks in armv8/generic_timer.c), but it would still be interesting
> to find the real root cause.

Can you try and ask around? If this bites us in U-Boot, there's a good 
chance Linux timers should be broken too, no?

I remember that NXP had similar problems with the timer:

   https://patchwork.kernel.org/patch/9344727/


Alex
Andre Przywara Nov. 4, 2016, 10:33 a.m. UTC | #4
Hi,

On 26/10/16 08:14, Alexander Graf wrote:
> On 10/26/2016 02:07 AM, André Przywara wrote:
>> On 25/10/16 08:52, Alexander Graf wrote:
>>
>> Hi Alex,
>>
>> thanks for looking at this!
>>
>>> On 25/10/2016 02:51, Andre Przywara wrote:
>>>> On systems using the generic timer routines defined in lib/time.c we
>>>> use timebase_l and timebase_h fields from the gd to detect wraparounds
>>>> in our tick counter. The tick calculcation algorithm silently assumes
>>>> that a long is only 32 bits, which leads to wrong results when
>>>> timebase_h
>>>> is not 0 on 64-bit systems.
>>>> As one possible fix lets initialize timebase_h (and timebase_l) to
>>>> 0, so
>>>> on 64-bit systems timebase_h will never(TM) be bigger than 0 and thus
>>>> cannot spoil timebase_l by being ORed into it.
>>>>
>>>> This fixes occasional timeout issues on the Pine64 (and possibly other
>>>> ARM64 systems).
>> Well, not really (this fix isn't sufficient), but read on ...
>>
>>>> Signed-off-by: Andre Przywara <andre.przywara@arm.com>
>>>> ---
>>>> Hi,
>>>>
>>>> I am bit puzzled what the proper fix is, this one is the easiest I
>>>> could
>>>> come up with. Not sure if the gd should be zeroed normally (and it's
>>>> just
>>>> broken on sunxi/arm64 because of some linker issues) or whether we
>>>> really
>>>> forgot to initialize those fields and just got away with it.
>>> The gd clearing happens via crt0_64.S -> board_init_f_init_reserve().
>>> There we should have fully cleared all memory associated with global
>>> data.
>> Ah,thanks for pointing to that. I was a bit clueless where to start
>> looking for it - "git grep gd" is obviously not a good idea ;-)
>>
>>> I can't see anything obviously wrong in that code. Could you try to dump
>>> gd if the timer offsets are != 0 on init? Maybe we can conclude
>>> something from the data in it.
>> So I agree that this code looks sane and indeed in all my dumps it looks
>> like the initialization works fine.
>> I did some more debugging and learned that the increased timebase_h
>> comes from detected overflows: in fact some readings are really lower
>> than previous ones:
>> ...
>> MMC:   SUNXI SD/MMC: 0
>> get_ticks() overflow: now: 118046720, tbl: 118063103, tbh: 0
>> get_ticks() overflow: now: 118439936, tbl: 118456318, tbh: 1
>> get_ticks() overflow: now: 118571008, tbl: 118587391, tbh: 2
>> get_ticks() overflow: now: 118734848, tbl: 118751230, tbh: 3
>> get_ticks() overflow: now: 119422976, tbl: 119439358, tbh: 4
>> get_ticks() overflow: now: 119783424, tbl: 119799807, tbh: 5
>> get_ticks() overflow: now: 120045568, tbl: 120061950, tbh: 6
>> get_ticks() overflow: now: 120406016, tbl: 120422398, tbh: 7
>> *** Warning - bad CRC, using default environment
>> ......
>> Not sure how this actually happens - I am not aware of any such severe
>> hardware errata in the A53 r0p4 or the timer, also I think that would
>> have bitten us elsewhere already.
>> As ATF keeps the secondaries in WFI, U-Boot only runs on CPU0 (confirmed
>> by MPIDR reads).
>> Also U-Boot reads the physical counter, so a bogus CNTVOFF can also not
>> be the culprit.
>>
>> So I can fix this annoying issue by using one of the other proposed
>> fixes (handling timebase_h only if BITS_PER_LONG < 64 or defining
>> get_ticks in armv8/generic_timer.c), but it would still be interesting
>> to find the real root cause.
> 
> Can you try and ask around? If this bites us in U-Boot, there's a good
> chance Linux timers should be broken too, no?
> 
> I remember that NXP had similar problems with the timer:
> 
>   https://patchwork.kernel.org/patch/9344727/

So I did some more experiments, and indeed it looks very much like a
silicon bug in the A64. Philipp found this in the BSP kernel:
https://github.com/longsleep/linux-pine64/blob/pine64-hacks-1.2/drivers/clocksource/arm_arch_timer.c#L231

So I have a direct test that reads both the CNTVCNT register and
CLOCK_MONOTONIC_RAW back to back for a few million times and looks for
decreasing values. On the Freescale box it triggers on every boot, on
A64 only on certain boots.

But the error pattern looks similar (though not identical), on the A64
it is:
CNTVCNT before: xxxx7fff, CNTVCNT after: xxxx4000
CNTVCNT before: xxxxffff, CNTVCNT after: xxxxc000
On the Freescale box I see different lengths of bogus lower bits, on the
A64 it's always the lower 15 bits.

These are at least the cases that I could spot (because the later read
returns a smaller value), there may be more corruptions that are harder
to find.

Setting the new Freescale erratum DT property in the arch timer node
fixes at least the Linux interface on the A64, as expected.
But as stated above this doesn't happen on every boot, so there might be
the chance that it's a missing initialization somewhere.

Since clock_gettime() gets disabled in the VDSO with the workaround, I'd
like to do more research to get an idea why it sometimes works and
sometimes not.

But unless we find another fix, I'll send a patch to enable the FSL
workaround (both in U-Boot and Linux).

Cheers,
Andre.
diff mbox

Patch

diff --git a/lib/time.c b/lib/time.c
index f37150f..b77d134 100644
--- a/lib/time.c
+++ b/lib/time.c
@@ -109,6 +109,8 @@  static uint64_t notrace tick_to_time(uint64_t tick)
 
 int __weak timer_init(void)
 {
+	gd->timebase_l = 0;
+	gd->timebase_h = 0;
 	return 0;
 }