diff mbox series

[U-Boot] Fix unreliable detection of DRAM size on Orange Pi 3

Message ID 20190728233942.9767-1-megous@megous.com
State Changes Requested
Delegated to: Jagannadha Sutradharudu Teki
Headers show
Series [U-Boot] Fix unreliable detection of DRAM size on Orange Pi 3 | expand

Commit Message

Ondřej Jirman July 28, 2019, 11:39 p.m. UTC
From: Ondrej Jirman <megous@megous.com>

Orange Pi 3 has 2 GiB of DRAM, that sometime get misdetected
as 4 GiB, due to false negative result from mctl_mem_matches()
when detecting number of column address bits. This leads to
u-boot detecting more address bits than there are and the
boot process hangs shortly after.

In mctl_mem_matches() we need to wait for each write to finish,
separately. Without this, the check is not reliable for some
unknown reason, probably having to do with unpredictable memory
access ordering.

Patch was made with help from André Przywara, who noticed that
my original idea about detection failing due to read-back from
cache without involving DRAM was false, because data cache is
still of at the time of the DRAM size autodetection.

Signed-off-by: Ondrej Jirman <megous@megous.com>
Cc: André Przywara <andre.przywara@arm.com>
---
 arch/arm/mach-sunxi/dram_helpers.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Ondřej Jirman Aug. 24, 2019, 8:07 p.m. UTC | #1
Hi Jagan,

can you please apply this patch to the sunxi tree, so that it
doesn't get lost.

thank you,
	Ondrej

On Mon, Jul 29, 2019 at 01:39:42AM +0200, megous hlavni wrote:
> From: Ondrej Jirman <megous@megous.com>
> 
> Orange Pi 3 has 2 GiB of DRAM, that sometime get misdetected
> as 4 GiB, due to false negative result from mctl_mem_matches()
> when detecting number of column address bits. This leads to
> u-boot detecting more address bits than there are and the
> boot process hangs shortly after.
> 
> In mctl_mem_matches() we need to wait for each write to finish,
> separately. Without this, the check is not reliable for some
> unknown reason, probably having to do with unpredictable memory
> access ordering.
> 
> Patch was made with help from André Przywara, who noticed that
> my original idea about detection failing due to read-back from
> cache without involving DRAM was false, because data cache is
> still of at the time of the DRAM size autodetection.
> 
> Signed-off-by: Ondrej Jirman <megous@megous.com>
> Cc: André Przywara <andre.przywara@arm.com>
> ---
>  arch/arm/mach-sunxi/dram_helpers.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm/mach-sunxi/dram_helpers.c b/arch/arm/mach-sunxi/dram_helpers.c
> index 239ab421a8..6dba448638 100644
> --- a/arch/arm/mach-sunxi/dram_helpers.c
> +++ b/arch/arm/mach-sunxi/dram_helpers.c
> @@ -30,6 +30,7 @@ bool mctl_mem_matches(u32 offset)
>  {
>  	/* Try to write different values to RAM at two addresses */
>  	writel(0, CONFIG_SYS_SDRAM_BASE);
> +	dsb();
>  	writel(0xaa55aa55, (ulong)CONFIG_SYS_SDRAM_BASE + offset);
>  	dsb();
>  	/* Check if the same value is actually observed when reading back */
> -- 
> 2.22.0
>
Siarhei Siamashka Aug. 25, 2019, 2:41 p.m. UTC | #2
On Sat, 24 Aug 2019 22:07:43 +0200
Ondřej Jirman <megous@megous.com> wrote:

> Hi Jagan,
> 
> can you please apply this patch to the sunxi tree, so that it
> doesn't get lost.
> 
> thank you,
> 	Ondrej
> 
> On Mon, Jul 29, 2019 at 01:39:42AM +0200, megous hlavni wrote:
> > From: Ondrej Jirman <megous@megous.com>
> > 
> > Orange Pi 3 has 2 GiB of DRAM, that sometime get misdetected
> > as 4 GiB, due to false negative result from mctl_mem_matches()
> > when detecting number of column address bits. This leads to
> > u-boot detecting more address bits than there are and the
> > boot process hangs shortly after.
> > 
> > In mctl_mem_matches() we need to wait for each write to finish,
> > separately. Without this, the check is not reliable for some
> > unknown reason, probably having to do with unpredictable memory
> > access ordering.
> > 
> > Patch was made with help from André Przywara, who noticed that
> > my original idea about detection failing due to read-back from
> > cache without involving DRAM was false, because data cache is
> > still of at the time of the DRAM size autodetection.
> > 
> > Signed-off-by: Ondrej Jirman <megous@megous.com>
> > Cc: André Przywara <andre.przywara@arm.com>
> > ---
> >  arch/arm/mach-sunxi/dram_helpers.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/arm/mach-sunxi/dram_helpers.c b/arch/arm/mach-sunxi/dram_helpers.c
> > index 239ab421a8..6dba448638 100644
> > --- a/arch/arm/mach-sunxi/dram_helpers.c
> > +++ b/arch/arm/mach-sunxi/dram_helpers.c
> > @@ -30,6 +30,7 @@ bool mctl_mem_matches(u32 offset)
> >  {
> >  	/* Try to write different values to RAM at two addresses */
> >  	writel(0, CONFIG_SYS_SDRAM_BASE);
> > +	dsb();
> >  	writel(0xaa55aa55, (ulong)CONFIG_SYS_SDRAM_BASE + offset);
> >  	dsb();
> >  	/* Check if the same value is actually observed when reading back */
> > -- 

Hi!

This looks like resurfacing of the old problem, which did not get a
complete fix back in 2016:
   https://lists.denx.de/pipermail/u-boot/2016-April/251803.html

Now the main question is whether the DSB instruction's barrier
magic is really required here or maybe it's just a timing issue.

Could you please experiment with this problem a little bit more?

 1. What if you just move this second DSB instruction after the
    second write and have two of them there? Does this also make
    the problem disappear?

 2. What if you just replace DSB with udelay(1) / udelay(10) /
    udelay(100)? Does this make the problem disappear?


I suspect that we may be dealing with some sort of buffering in
the DRAM controller itself and the only reason why DSB helps is
that it introduces some delay because it's a rather slow
instruction.
Ondřej Jirman Aug. 25, 2019, 4:12 p.m. UTC | #3
Hello,

On Sun, Aug 25, 2019 at 05:41:55PM +0300, Siarhei Siamashka wrote:
> On Sat, 24 Aug 2019 22:07:43 +0200
> Ondřej Jirman <megous@megous.com> wrote:
> 
> > Hi Jagan,
> > 
> > can you please apply this patch to the sunxi tree, so that it
> > doesn't get lost.
> > 
> > thank you,
> > 	Ondrej
> > 
> > On Mon, Jul 29, 2019 at 01:39:42AM +0200, megous hlavni wrote:
> > > From: Ondrej Jirman <megous@megous.com>
> > > 
> > > Orange Pi 3 has 2 GiB of DRAM, that sometime get misdetected
> > > as 4 GiB, due to false negative result from mctl_mem_matches()
> > > when detecting number of column address bits. This leads to
> > > u-boot detecting more address bits than there are and the
> > > boot process hangs shortly after.
> > > 
> > > In mctl_mem_matches() we need to wait for each write to finish,
> > > separately. Without this, the check is not reliable for some
> > > unknown reason, probably having to do with unpredictable memory
> > > access ordering.
> > > 
> > > Patch was made with help from André Przywara, who noticed that
> > > my original idea about detection failing due to read-back from
> > > cache without involving DRAM was false, because data cache is
> > > still of at the time of the DRAM size autodetection.
> > > 
> > > Signed-off-by: Ondrej Jirman <megous@megous.com>
> > > Cc: André Przywara <andre.przywara@arm.com>
> > > ---
> > >  arch/arm/mach-sunxi/dram_helpers.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/arch/arm/mach-sunxi/dram_helpers.c b/arch/arm/mach-sunxi/dram_helpers.c
> > > index 239ab421a8..6dba448638 100644
> > > --- a/arch/arm/mach-sunxi/dram_helpers.c
> > > +++ b/arch/arm/mach-sunxi/dram_helpers.c
> > > @@ -30,6 +30,7 @@ bool mctl_mem_matches(u32 offset)
> > >  {
> > >  	/* Try to write different values to RAM at two addresses */
> > >  	writel(0, CONFIG_SYS_SDRAM_BASE);
> > > +	dsb();
> > >  	writel(0xaa55aa55, (ulong)CONFIG_SYS_SDRAM_BASE + offset);
> > >  	dsb();
> > >  	/* Check if the same value is actually observed when reading back */
> > > -- 
> 
> Hi!
> 
> This looks like resurfacing of the old problem, which did not get a
> complete fix back in 2016:
>    https://lists.denx.de/pipermail/u-boot/2016-April/251803.html
> 
> Now the main question is whether the DSB instruction's barrier
> magic is really required here or maybe it's just a timing issue.

That's certainly possible, because without this patch, the problem
appears intermittently, and not always. I also have the same end
result, where mctl_mem_matches returns false when it should not.

> Could you please experiment with this problem a little bit more?
> 
>  1. What if you just move this second DSB instruction after the
>     second write and have two of them there? Does this also make
>     the problem disappear?
> 
>  2. What if you just replace DSB with udelay(1) / udelay(10) /
>     udelay(100)? Does this make the problem disappear?
> 
> 
> I suspect that we may be dealing with some sort of buffering in
> the DRAM controller itself and the only reason why DSB helps is
> that it introduces some delay because it's a rather slow
> instruction.

Thanks for suggestions. If it's just the delay, then whatever delay
dsb introduces seems enough. It'll probably need just a small delay,
because most of the time it works without one.

regards,
	Ondrej

> -- 
> Best regards,
> Siarhei Siamashka
Siarhei Siamashka Aug. 25, 2019, 11:36 p.m. UTC | #4
On Sun, 25 Aug 2019 18:12:22 +0200
Ondřej Jirman <megous@megous.com> wrote:

> Hello,
> 
> On Sun, Aug 25, 2019 at 05:41:55PM +0300, Siarhei Siamashka wrote:
> > On Sat, 24 Aug 2019 22:07:43 +0200
> > Ondřej Jirman <megous@megous.com> wrote:
> > 
> > > Hi Jagan,
> > > 
> > > can you please apply this patch to the sunxi tree, so that it
> > > doesn't get lost.
> > > 
> > > thank you,
> > > 	Ondrej
> > > 
> > > On Mon, Jul 29, 2019 at 01:39:42AM +0200, megous hlavni wrote:
> > > > From: Ondrej Jirman <megous@megous.com>
> > > > 
> > > > Orange Pi 3 has 2 GiB of DRAM, that sometime get misdetected
> > > > as 4 GiB, due to false negative result from mctl_mem_matches()
> > > > when detecting number of column address bits. This leads to
> > > > u-boot detecting more address bits than there are and the
> > > > boot process hangs shortly after.
> > > > 
> > > > In mctl_mem_matches() we need to wait for each write to finish,
> > > > separately. Without this, the check is not reliable for some
> > > > unknown reason, probably having to do with unpredictable memory
> > > > access ordering.
> > > > 
> > > > Patch was made with help from André Przywara, who noticed that
> > > > my original idea about detection failing due to read-back from
> > > > cache without involving DRAM was false, because data cache is
> > > > still of at the time of the DRAM size autodetection.
> > > > 
> > > > Signed-off-by: Ondrej Jirman <megous@megous.com>
> > > > Cc: André Przywara <andre.przywara@arm.com>
> > > > ---
> > > >  arch/arm/mach-sunxi/dram_helpers.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/arch/arm/mach-sunxi/dram_helpers.c b/arch/arm/mach-sunxi/dram_helpers.c
> > > > index 239ab421a8..6dba448638 100644
> > > > --- a/arch/arm/mach-sunxi/dram_helpers.c
> > > > +++ b/arch/arm/mach-sunxi/dram_helpers.c
> > > > @@ -30,6 +30,7 @@ bool mctl_mem_matches(u32 offset)
> > > >  {
> > > >  	/* Try to write different values to RAM at two addresses */
> > > >  	writel(0, CONFIG_SYS_SDRAM_BASE);
> > > > +	dsb();
> > > >  	writel(0xaa55aa55, (ulong)CONFIG_SYS_SDRAM_BASE + offset);
> > > >  	dsb();
> > > >  	/* Check if the same value is actually observed when reading back */
> > > > -- 
> > 
> > Hi!
> > 
> > This looks like resurfacing of the old problem, which did not get a
> > complete fix back in 2016:
> >    https://lists.denx.de/pipermail/u-boot/2016-April/251803.html
> > 
> > Now the main question is whether the DSB instruction's barrier
> > magic is really required here or maybe it's just a timing issue.
> 
> That's certainly possible, because without this patch, the problem
> appears intermittently, and not always. I also have the same end
> result, where mctl_mem_matches returns false when it should not.

You could also try to see if the problem becomes easier or harder
to reproduce if you change the CPU or DRAM clock speed.
 
> > Could you please experiment with this problem a little bit more?
> > 
> >  1. What if you just move this second DSB instruction after the
> >     second write and have two of them there? Does this also make
> >     the problem disappear?
> > 
> >  2. What if you just replace DSB with udelay(1) / udelay(10) /
> >     udelay(100)? Does this make the problem disappear?
> > 
> > 
> > I suspect that we may be dealing with some sort of buffering in
> > the DRAM controller itself and the only reason why DSB helps is
> > that it introduces some delay because it's a rather slow
> > instruction.
> 
> Thanks for suggestions. If it's just the delay, then whatever delay
> dsb introduces seems enough. It'll probably need just a small delay,
> because most of the time it works without one.

It still would be better to estimate how long is the delay introduced
by a single DSB instruction. If it's small, then going from one DSB
instructions to two DSB instructions might be not enough to completely
eliminate the problem. Or it might fully fix the problem on your
Orange Pi 3 board, but then show up again on some other SoC with a
different CPU clock speed.

I agree that we need to ensure that the two write instructions
do not get reordered relative to each other. And also ensure
that the second write instruction and read and not reordered
relative to each other either. So the idea of your patch is
good.

My only concern is the choice of DSB instructions, because I
suspect that these memory requests coming from the CPU might
be already properly ordered with or without the use of DSB.
If it's the DRAM controller doing all the buffering or requests
reordering outside of the CPU, then we may still have a race.

Uncached DRAM read/write operations usually need time in a
ballpark of 100 ns. Inserting 1us delays between memory
accesses should be long enough to ensure their completion.

Anyway, it's up to you whether to do some additional tests
or not, in the case if you are curious about what's going on.

If your patch is just applied as-is, then the problem will be
resolved for now. And the worst thing that may happen, would
be somebody encountering the same problem again in the future
and coming here to ask "Hey, what is the right place to
insert a third DSB instruction?" ;-)
Andre Przywara Aug. 30, 2019, 12:44 a.m. UTC | #5
On 25/08/2019 15:41, Siarhei Siamashka wrote:
> On Sat, 24 Aug 2019 22:07:43 +0200
> Ondřej Jirman <megous@megous.com> wrote:

Hi,

>> Hi Jagan,
>>
>> can you please apply this patch to the sunxi tree, so that it
>> doesn't get lost.
>>
>> thank you,
>> 	Ondrej
>>
>> On Mon, Jul 29, 2019 at 01:39:42AM +0200, megous hlavni wrote:
>>> From: Ondrej Jirman <megous@megous.com>
>>>
>>> Orange Pi 3 has 2 GiB of DRAM, that sometime get misdetected
>>> as 4 GiB, due to false negative result from mctl_mem_matches()
>>> when detecting number of column address bits. This leads to
>>> u-boot detecting more address bits than there are and the
>>> boot process hangs shortly after.
>>>
>>> In mctl_mem_matches() we need to wait for each write to finish,
>>> separately. Without this, the check is not reliable for some
>>> unknown reason, probably having to do with unpredictable memory
>>> access ordering.
>>>
>>> Patch was made with help from André Przywara, who noticed that
>>> my original idea about detection failing due to read-back from
>>> cache without involving DRAM was false, because data cache is
>>> still of at the time of the DRAM size autodetection.
>>>
>>> Signed-off-by: Ondrej Jirman <megous@megous.com>
>>> Cc: André Przywara <andre.przywara@arm.com>
>>> ---
>>>  arch/arm/mach-sunxi/dram_helpers.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/arm/mach-sunxi/dram_helpers.c b/arch/arm/mach-sunxi/dram_helpers.c
>>> index 239ab421a8..6dba448638 100644
>>> --- a/arch/arm/mach-sunxi/dram_helpers.c
>>> +++ b/arch/arm/mach-sunxi/dram_helpers.c
>>> @@ -30,6 +30,7 @@ bool mctl_mem_matches(u32 offset)
>>>  {
>>>  	/* Try to write different values to RAM at two addresses */
>>>  	writel(0, CONFIG_SYS_SDRAM_BASE);
>>> +	dsb();
>>>  	writel(0xaa55aa55, (ulong)CONFIG_SYS_SDRAM_BASE + offset);
>>>  	dsb();
>>>  	/* Check if the same value is actually observed when reading back */
>>> -- 
> 
> Hi!
> 
> This looks like resurfacing of the old problem, which did not get a
> complete fix back in 2016:
>    https://lists.denx.de/pipermail/u-boot/2016-April/251803.html
> 
> Now the main question is whether the DSB instruction's barrier
> magic is really required here or maybe it's just a timing issue.

I think both:
From the compiler's and the CPU's point of view these two stores look
independent: they don't have any data dependency and go to different
addresses. So both the compiler and the CPU are allowed to reorder these
stores. However, we are after some non-obvious aliasing effect, so in
fact there *is* a dependency between the two stores. So architecturally
this barrier is definitely required, to notify everyone about this
dependency (although I think a dmb(); might be sufficient here).

So for the sake of this patch, we should take it. It is needed and
apparently solves one issue, and I can't see any harm in it.

Regardless of this I was wondering if in this situation (single in-order
core running with MMU and caches off) this is the full story, as the CPU
does not have a good reason to reorder.

> Could you please experiment with this problem a little bit more?
> 
>  1. What if you just move this second DSB instruction after the
>     second write and have two of them there? Does this also make
>     the problem disappear?

I don't think two dsb()s next to each other will do anything useful. I
would think the second dsb() would just be ignored, as the order is
already preserved and all memory accesses have been completed:
"A DSB is a memory barrier that ensures that memory accesses that occur
before the DSB have completed before the completion of the DSB
instruction." (ARMv8 ARM, section B2.3.5: DSB)
I can't imagine this would introduce any kind of relevant delay, as a
DSB is just a barrier.

>  2. What if you just replace DSB with udelay(1) / udelay(10) /
>     udelay(100)? Does this make the problem disappear?>
> 
> I suspect that we may be dealing with some sort of buffering in
> the DRAM controller itself

Indeed it seems perfectly possible that the DDR3 controller buffers and
potentially reorders those two writes.
The canonical solution would to force the write by telling the memory
controller directly, however this is probably not only controller
specific, but also undocumented.

I was wondering whether waiting for the next refresh would help (or for
the period of one refresh cycle), but I guess the DRAM controller could
buffer requests beyond that point.

> and the only reason why DSB helps is
> that it introduces some delay because it's a rather slow
> instruction.

How much delay a DSB instruction introduces, is dependent on many
things, but mostly on the number of outstanding memory requests and the
time it takes them to complete them. So I am not sure we can call it a
"slow instruction" per se.

So to summarise: I think Siarhei has a point in that this is actually
more of a timing issue, but the DSB in there is needed and we should
apply this patch:

Reviewed-by: Andre Przywara <andre.przywara@arm.com>

Cheers,
Andre.
Ondřej Jirman Aug. 30, 2019, 10:56 a.m. UTC | #6
Hello,

On Fri, Aug 30, 2019 at 01:44:34AM +0100, André Przywara wrote:
> On 25/08/2019 15:41, Siarhei Siamashka wrote:
> > On Sat, 24 Aug 2019 22:07:43 +0200
> > Ondřej Jirman <megous@megous.com> wrote:
> 
> Hi,
> 
> >> Hi Jagan,
> >>
> >> can you please apply this patch to the sunxi tree, so that it
> >> doesn't get lost.
> >>
> >> thank you,
> >> 	Ondrej
> >>
> >> On Mon, Jul 29, 2019 at 01:39:42AM +0200, megous hlavni wrote:
> >>> From: Ondrej Jirman <megous@megous.com>
> >>>
> >>> Orange Pi 3 has 2 GiB of DRAM, that sometime get misdetected
> >>> as 4 GiB, due to false negative result from mctl_mem_matches()
> >>> when detecting number of column address bits. This leads to
> >>> u-boot detecting more address bits than there are and the
> >>> boot process hangs shortly after.
> >>>
> >>> In mctl_mem_matches() we need to wait for each write to finish,
> >>> separately. Without this, the check is not reliable for some
> >>> unknown reason, probably having to do with unpredictable memory
> >>> access ordering.
> >>>
> >>> Patch was made with help from André Przywara, who noticed that
> >>> my original idea about detection failing due to read-back from
> >>> cache without involving DRAM was false, because data cache is
> >>> still of at the time of the DRAM size autodetection.
> >>>
> >>> Signed-off-by: Ondrej Jirman <megous@megous.com>
> >>> Cc: André Przywara <andre.przywara@arm.com>
> >>> ---
> >>>  arch/arm/mach-sunxi/dram_helpers.c | 1 +
> >>>  1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/arch/arm/mach-sunxi/dram_helpers.c b/arch/arm/mach-sunxi/dram_helpers.c
> >>> index 239ab421a8..6dba448638 100644
> >>> --- a/arch/arm/mach-sunxi/dram_helpers.c
> >>> +++ b/arch/arm/mach-sunxi/dram_helpers.c
> >>> @@ -30,6 +30,7 @@ bool mctl_mem_matches(u32 offset)
> >>>  {
> >>>  	/* Try to write different values to RAM at two addresses */
> >>>  	writel(0, CONFIG_SYS_SDRAM_BASE);
> >>> +	dsb();
> >>>  	writel(0xaa55aa55, (ulong)CONFIG_SYS_SDRAM_BASE + offset);
> >>>  	dsb();
> >>>  	/* Check if the same value is actually observed when reading back */
> >>> -- 
> > 
> > Hi!
> > 
> > This looks like resurfacing of the old problem, which did not get a
> > complete fix back in 2016:
> >    https://lists.denx.de/pipermail/u-boot/2016-April/251803.html
> > 
> > Now the main question is whether the DSB instruction's barrier
> > magic is really required here or maybe it's just a timing issue.
> 
> I think both:
> From the compiler's and the CPU's point of view these two stores look
> independent: they don't have any data dependency and go to different
> addresses. So both the compiler and the CPU are allowed to reorder these
> stores. However, we are after some non-obvious aliasing effect, so in
> fact there *is* a dependency between the two stores. So architecturally
> this barrier is definitely required, to notify everyone about this
> dependency (although I think a dmb(); might be sufficient here).

OTOH, let's visualize it:

CPU address                             DRAM address
---------------------------------------------------------------------------
[w 00000000] SDRAM_BASE                 0
[w aa55aa55] SDRAM_BASE + offset        0

Both writes should end up at the same physical location in dram).

The following test checks that values that are read back are the same. They
should be the same in this specific case of Opi3 SBC. It doesn't check for any
specific value, so even if writes are re-ordered, the value that's read back
should still be the same.

Order of operations is:

  writel(0, CONFIG_SYS_SDRAM_BASE);
  writel(0xaa55aa55, (ulong)CONFIG_SYS_SDRAM_BASE + offset);
  dsb();
  readl(CONFIG_SYS_SDRAM_BASE)
  readl((ulong)CONFIG_SYS_SDRAM_BASE + offset);

The CPU can only re-order either the writes or reads. But neither should matter.

Memory controller is a black box to me and that's where the issue probably
lies.

The only time this should fail and return different values when reading is if
the hardware does some shortcut/assumption, and reads will not go fully from
DRAM after the writes.

Anyway, I'm failing to reproduce this even without any patches, ATM.

regards,
	o.

> So for the sake of this patch, we should take it. It is needed and
> apparently solves one issue, and I can't see any harm in it.
> 
> Regardless of this I was wondering if in this situation (single in-order
> core running with MMU and caches off) this is the full story, as the CPU
> does not have a good reason to reorder.
> 
> > Could you please experiment with this problem a little bit more?
> > 
> >  1. What if you just move this second DSB instruction after the
> >     second write and have two of them there? Does this also make
> >     the problem disappear?
> 
> I don't think two dsb()s next to each other will do anything useful. I
> would think the second dsb() would just be ignored, as the order is
> already preserved and all memory accesses have been completed:
> "A DSB is a memory barrier that ensures that memory accesses that occur
> before the DSB have completed before the completion of the DSB
> instruction." (ARMv8 ARM, section B2.3.5: DSB)
> I can't imagine this would introduce any kind of relevant delay, as a
> DSB is just a barrier.
> 
> >  2. What if you just replace DSB with udelay(1) / udelay(10) /
> >     udelay(100)? Does this make the problem disappear?>
> > 
> > I suspect that we may be dealing with some sort of buffering in
> > the DRAM controller itself
> 
> Indeed it seems perfectly possible that the DDR3 controller buffers and
> potentially reorders those two writes.
> The canonical solution would to force the write by telling the memory
> controller directly, however this is probably not only controller
> specific, but also undocumented.
> 
> I was wondering whether waiting for the next refresh would help (or for
> the period of one refresh cycle), but I guess the DRAM controller could
> buffer requests beyond that point.
> 
> > and the only reason why DSB helps is
> > that it introduces some delay because it's a rather slow
> > instruction.
> 
> How much delay a DSB instruction introduces, is dependent on many
> things, but mostly on the number of outstanding memory requests and the
> time it takes them to complete them. So I am not sure we can call it a
> "slow instruction" per se.
> 
> So to summarise: I think Siarhei has a point in that this is actually
> more of a timing issue, but the DSB in there is needed and we should
> apply this patch:
> 
> Reviewed-by: Andre Przywara <andre.przywara@arm.com>
> 
> Cheers,
> Andre.
>
Andre Przywara Aug. 30, 2019, 4:19 p.m. UTC | #7
On Fri, 30 Aug 2019 12:56:28 +0200
Ondřej Jirman <megous@megous.com> wrote:

Hi Ondřej,

> Hello,
> 
> On Fri, Aug 30, 2019 at 01:44:34AM +0100, André Przywara wrote:
> > On 25/08/2019 15:41, Siarhei Siamashka wrote:
> > > On Sat, 24 Aug 2019 22:07:43 +0200
> > > Ondřej Jirman <megous@megous.com> wrote:
> > 
> > Hi,
> > 
> > >> Hi Jagan,
> > >>
> > >> can you please apply this patch to the sunxi tree, so that it
> > >> doesn't get lost.
> > >>
> > >> thank you,
> > >> 	Ondrej
> > >>
> > >> On Mon, Jul 29, 2019 at 01:39:42AM +0200, megous hlavni wrote:
> > >>> From: Ondrej Jirman <megous@megous.com>
> > >>>
> > >>> Orange Pi 3 has 2 GiB of DRAM, that sometime get misdetected
> > >>> as 4 GiB, due to false negative result from mctl_mem_matches()
> > >>> when detecting number of column address bits. This leads to
> > >>> u-boot detecting more address bits than there are and the
> > >>> boot process hangs shortly after.
> > >>>
> > >>> In mctl_mem_matches() we need to wait for each write to finish,
> > >>> separately. Without this, the check is not reliable for some
> > >>> unknown reason, probably having to do with unpredictable memory
> > >>> access ordering.
> > >>>
> > >>> Patch was made with help from André Przywara, who noticed that
> > >>> my original idea about detection failing due to read-back from
> > >>> cache without involving DRAM was false, because data cache is
> > >>> still of at the time of the DRAM size autodetection.
> > >>>
> > >>> Signed-off-by: Ondrej Jirman <megous@megous.com>
> > >>> Cc: André Przywara <andre.przywara@arm.com>
> > >>> ---
> > >>>  arch/arm/mach-sunxi/dram_helpers.c | 1 +
> > >>>  1 file changed, 1 insertion(+)
> > >>>
> > >>> diff --git a/arch/arm/mach-sunxi/dram_helpers.c b/arch/arm/mach-sunxi/dram_helpers.c
> > >>> index 239ab421a8..6dba448638 100644
> > >>> --- a/arch/arm/mach-sunxi/dram_helpers.c
> > >>> +++ b/arch/arm/mach-sunxi/dram_helpers.c
> > >>> @@ -30,6 +30,7 @@ bool mctl_mem_matches(u32 offset)
> > >>>  {
> > >>>  	/* Try to write different values to RAM at two addresses */
> > >>>  	writel(0, CONFIG_SYS_SDRAM_BASE);
> > >>> +	dsb();
> > >>>  	writel(0xaa55aa55, (ulong)CONFIG_SYS_SDRAM_BASE + offset);
> > >>>  	dsb();
> > >>>  	/* Check if the same value is actually observed when reading back */
> > >>> -- 
> > > 
> > > Hi!
> > > 
> > > This looks like resurfacing of the old problem, which did not get a
> > > complete fix back in 2016:
> > >    https://lists.denx.de/pipermail/u-boot/2016-April/251803.html
> > > 
> > > Now the main question is whether the DSB instruction's barrier
> > > magic is really required here or maybe it's just a timing issue.
> > 
> > I think both:
> > From the compiler's and the CPU's point of view these two stores look
> > independent: they don't have any data dependency and go to different
> > addresses. So both the compiler and the CPU are allowed to reorder these
> > stores. However, we are after some non-obvious aliasing effect, so in
> > fact there *is* a dependency between the two stores. So architecturally
> > this barrier is definitely required, to notify everyone about this
> > dependency (although I think a dmb(); might be sufficient here).
> 
> OTOH, let's visualize it:
> 
> CPU address                             DRAM address
> ---------------------------------------------------------------------------
> [w 00000000] SDRAM_BASE                 0
> [w aa55aa55] SDRAM_BASE + offset        0
> 
> Both writes should end up at the same physical location in dram).
> 
> The following test checks that values that are read back are the same. They
> should be the same in this specific case of Opi3 SBC. It doesn't check for any
> specific value, so even if writes are re-ordered, the value that's read back
> should still be the same.

Ah, right. I was actually staring at my modified code, where I try to detect half DQ problems with some weird 3GB setup on the Eachlink H6 Mini. So I explicitly check for and compare to the second value written, and expect this write to have overwritten the first 0 write.

> Order of operations is:
> 
>   writel(0, CONFIG_SYS_SDRAM_BASE);
>   writel(0xaa55aa55, (ulong)CONFIG_SYS_SDRAM_BASE + offset);
>   dsb();

From all I can see this means at this point the two writes are "on the bus" ("completed" from the CPU's point of view).

>   readl(CONFIG_SYS_SDRAM_BASE)

This should actually push the first write to the device, as a read following a write to the same address requires the write to take effect first. Might still end up in a write buffer in the DRAM controller, though.

>   readl((ulong)CONFIG_SYS_SDRAM_BASE + offset);
> 
> The CPU can only re-order either the writes or reads. But neither should matter.

Yes, true, for the existing code (which compares the two reads against each other).

> Memory controller is a black box to me and that's where the issue probably
> lies.

Yes, the plot thickens. Maybe the DRAM controller buffers the write request(s), as it waits for the page(s) to open in the chips, for instance. Now the reads can be satisfied from this buffer, since the addresses match. No idea if it would break down the addresses into rows/cols/bank/rank already to detect the aliasing.

> The only time this should fail and return different values when reading is if
> the hardware does some shortcut/assumption, and reads will not go fully from
> DRAM after the writes.

Sounds tempting, but with the caches and the MMU off (so treating everything as device memory) I don't see where this would happen.
The only case I could imagine would be *in* the DRAM controller, as mentioned above.

So in this case we would need to tell the DRAM controller to complete all outstanding requests (the two writes) first. Then issue the reads. As mentioned, sounds possible, but I have no clue how to do this.
A slight hack would indeed be some delay loop, to give the DRAM controller time to fiddle the writes into the DRAM matrix.
Or flood it with writes, then DSB to force this buffer to drain?

> Anyway, I'm failing to reproduce this even without any patches, ATM.

Mmmh, shame.
Any changes in your setup you can think of? New power supply?
Maybe we should indeed experiment with lower clocks? Did you run stability tests in Linux? To prove that the DRAM timing is actually solid?

Cheers,
Andre.

> regards,
> 	o.
> 
> > So for the sake of this patch, we should take it. It is needed and
> > apparently solves one issue, and I can't see any harm in it.
> > 
> > Regardless of this I was wondering if in this situation (single in-order
> > core running with MMU and caches off) this is the full story, as the CPU
> > does not have a good reason to reorder.
> > 
> > > Could you please experiment with this problem a little bit more?
> > > 
> > >  1. What if you just move this second DSB instruction after the
> > >     second write and have two of them there? Does this also make
> > >     the problem disappear?
> > 
> > I don't think two dsb()s next to each other will do anything useful. I
> > would think the second dsb() would just be ignored, as the order is
> > already preserved and all memory accesses have been completed:
> > "A DSB is a memory barrier that ensures that memory accesses that occur
> > before the DSB have completed before the completion of the DSB
> > instruction." (ARMv8 ARM, section B2.3.5: DSB)
> > I can't imagine this would introduce any kind of relevant delay, as a
> > DSB is just a barrier.
> > 
> > >  2. What if you just replace DSB with udelay(1) / udelay(10) /
> > >     udelay(100)? Does this make the problem disappear?>
> > > 
> > > I suspect that we may be dealing with some sort of buffering in
> > > the DRAM controller itself
> > 
> > Indeed it seems perfectly possible that the DDR3 controller buffers and
> > potentially reorders those two writes.
> > The canonical solution would to force the write by telling the memory
> > controller directly, however this is probably not only controller
> > specific, but also undocumented.
> > 
> > I was wondering whether waiting for the next refresh would help (or for
> > the period of one refresh cycle), but I guess the DRAM controller could
> > buffer requests beyond that point.
> > 
> > > and the only reason why DSB helps is
> > > that it introduces some delay because it's a rather slow
> > > instruction.
> > 
> > How much delay a DSB instruction introduces, is dependent on many
> > things, but mostly on the number of outstanding memory requests and the
> > time it takes them to complete them. So I am not sure we can call it a
> > "slow instruction" per se.
> > 
> > So to summarise: I think Siarhei has a point in that this is actually
> > more of a timing issue, but the DSB in there is needed and we should
> > apply this patch:
> > 
> > Reviewed-by: Andre Przywara <andre.przywara@arm.com>
> > 
> > Cheers,
> > Andre.
> >
Ondřej Jirman Aug. 30, 2019, 8:40 p.m. UTC | #8
Hi André,

On Fri, Aug 30, 2019 at 05:19:02PM +0100, Andre Przywara wrote:
> > OTOH, let's visualize it:
> > 
> > CPU address                             DRAM address
> > ---------------------------------------------------------------------------
> > [w 00000000] SDRAM_BASE                 0
> > [w aa55aa55] SDRAM_BASE + offset        0
> > 
> > Both writes should end up at the same physical location in dram).
> > 
> > The following test checks that values that are read back are the same. They
> > should be the same in this specific case of Opi3 SBC. It doesn't check for any
> > specific value, so even if writes are re-ordered, the value that's read back
> > should still be the same.
> 
> Ah, right. I was actually staring at my modified code, where I try to detect
> half DQ problems with some weird 3GB setup on the Eachlink H6 Mini. So
> I explicitly check for and compare to the second value written, and expect
> this write to have overwritten the first 0 write.
> 
> > Order of operations is:
> > 
> >   writel(0, CONFIG_SYS_SDRAM_BASE);
> >   writel(0xaa55aa55, (ulong)CONFIG_SYS_SDRAM_BASE + offset);
> >   dsb();
> 
> From all I can see this means at this point the two writes are "on the bus"
> ("completed" from the CPU's point of view).
> 
> >   readl(CONFIG_SYS_SDRAM_BASE)
> 
> This should actually push the first write to the device, as a read following
> a write to the same address requires the write to take effect first. Might
> still end up in a write buffer in the DRAM controller, though.
> 
> >   readl((ulong)CONFIG_SYS_SDRAM_BASE + offset);
> > 
> > The CPU can only re-order either the writes or reads. But neither should matter.
> 
> Yes, true, for the existing code (which compares the two reads against each
> other).
> 
> > Memory controller is a black box to me and that's where the issue probably
> > lies.
> 
> Yes, the plot thickens. Maybe the DRAM controller buffers the write
> request(s), as it waits for the page(s) to open in the chips, for instance.
> Now the reads can be satisfied from this buffer, since the addresses match. No
> idea if it would break down the addresses into rows/cols/bank/rank already to
> detect the aliasing.
> 
> > The only time this should fail and return different values when reading is if
> > the hardware does some shortcut/assumption, and reads will not go fully from
> > DRAM after the writes.
> 
> Sounds tempting, but with the caches and the MMU off (so treating everything
> as device memory) I don't see where this would happen. The only case I could
> imagine would be *in* the DRAM controller, as mentioned above.

Yes, that's what I meant.

> So in this case we would need to tell the DRAM controller to complete all
> outstanding requests (the two writes) first. Then issue the reads. As
> mentioned, sounds possible, but I have no clue how to do this. A slight hack
> would indeed be some delay loop, to give the DRAM controller time to fiddle
> the writes into the DRAM matrix.
> Or flood it with writes, then DSB to force this buffer to drain?

That's antoher interesting idea. :) Though we don't know the size of the buffer.

> > Anyway, I'm failing to reproduce this even without any patches, ATM.
> 
> Mmmh, shame.
> Any changes in your setup you can think of? New power supply?
> Maybe we should indeed experiment with lower clocks? Did you run stability
> tests in Linux? To prove that the DRAM timing is actually solid?

I thought it might be temperature related, because now I'm using fan to cool
the board. But that turned out to be not it (I pre-heated the board, and
nothing).

No stability tests, but I've been compiling mesa repeatedly on the board
without a single obvious issue, I also push a lot of data through the board
to the USB connected SSD.

The kernel didn't panic unexpectedly yet. (serveral months)

I suspect that the DRAM timing is good.

I'm trying memterster now, and it looks good, too.

regards,
	o.

> Cheers,
> Andre.
> 
> > regards,
> > 	o.
> > 
> > > So for the sake of this patch, we should take it. It is needed and
> > > apparently solves one issue, and I can't see any harm in it.
> > > 
> > > Regardless of this I was wondering if in this situation (single in-order
> > > core running with MMU and caches off) this is the full story, as the CPU
> > > does not have a good reason to reorder.
> > > 
> > > > Could you please experiment with this problem a little bit more?
> > > > 
> > > >  1. What if you just move this second DSB instruction after the
> > > >     second write and have two of them there? Does this also make
> > > >     the problem disappear?
> > > 
> > > I don't think two dsb()s next to each other will do anything useful. I
> > > would think the second dsb() would just be ignored, as the order is
> > > already preserved and all memory accesses have been completed:
> > > "A DSB is a memory barrier that ensures that memory accesses that occur
> > > before the DSB have completed before the completion of the DSB
> > > instruction." (ARMv8 ARM, section B2.3.5: DSB)
> > > I can't imagine this would introduce any kind of relevant delay, as a
> > > DSB is just a barrier.
> > > 
> > > >  2. What if you just replace DSB with udelay(1) / udelay(10) /
> > > >     udelay(100)? Does this make the problem disappear?>
> > > > 
> > > > I suspect that we may be dealing with some sort of buffering in
> > > > the DRAM controller itself
> > > 
> > > Indeed it seems perfectly possible that the DDR3 controller buffers and
> > > potentially reorders those two writes.
> > > The canonical solution would to force the write by telling the memory
> > > controller directly, however this is probably not only controller
> > > specific, but also undocumented.
> > > 
> > > I was wondering whether waiting for the next refresh would help (or for
> > > the period of one refresh cycle), but I guess the DRAM controller could
> > > buffer requests beyond that point.
> > > 
> > > > and the only reason why DSB helps is
> > > > that it introduces some delay because it's a rather slow
> > > > instruction.
> > > 
> > > How much delay a DSB instruction introduces, is dependent on many
> > > things, but mostly on the number of outstanding memory requests and the
> > > time it takes them to complete them. So I am not sure we can call it a
> > > "slow instruction" per se.
> > > 
> > > So to summarise: I think Siarhei has a point in that this is actually
> > > more of a timing issue, but the DSB in there is needed and we should
> > > apply this patch:
> > > 
> > > Reviewed-by: Andre Przywara <andre.przywara@arm.com>
> > > 
> > > Cheers,
> > > Andre.
> > > 
>
diff mbox series

Patch

diff --git a/arch/arm/mach-sunxi/dram_helpers.c b/arch/arm/mach-sunxi/dram_helpers.c
index 239ab421a8..6dba448638 100644
--- a/arch/arm/mach-sunxi/dram_helpers.c
+++ b/arch/arm/mach-sunxi/dram_helpers.c
@@ -30,6 +30,7 @@  bool mctl_mem_matches(u32 offset)
 {
 	/* Try to write different values to RAM at two addresses */
 	writel(0, CONFIG_SYS_SDRAM_BASE);
+	dsb();
 	writel(0xaa55aa55, (ulong)CONFIG_SYS_SDRAM_BASE + offset);
 	dsb();
 	/* Check if the same value is actually observed when reading back */