diff mbox

[U-Boot,6/9] CACHE: nand read/write: Test if start address is aligned

Message ID 1340583477-14018-7-git-send-email-marex@denx.de
State Rejected
Delegated to: Marek Vasut
Headers show

Commit Message

Marek Vasut June 25, 2012, 12:17 a.m. UTC
This prevents the scenario where data cache is on and the
device uses DMA to deploy data. In that case, it might not
be possible to flush/invalidate data to RAM properly. The
other option is to use bounce buffer, but that involves a
lot of copying and therefore degrades performance rapidly.
Therefore disallow this possibility of unaligned load address
altogether if data cache is on.

Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Scott Wood <scottwood@freescale.com>
---
 common/cmd_nand.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Scott Wood June 25, 2012, 4:58 p.m. UTC | #1
On 06/24/2012 07:17 PM, Marek Vasut wrote:
> This prevents the scenario where data cache is on and the
> device uses DMA to deploy data. In that case, it might not
> be possible to flush/invalidate data to RAM properly. The
> other option is to use bounce buffer,

Or get cache coherent hardware. :-)

> but that involves a lot of copying and therefore degrades performance
> rapidly. Therefore disallow this possibility of unaligned load
> address altogether if data cache is on.

How about use the bounce buffer only if the address is misaligned?  The
corrective action a user has to take is the same as with this patch,
except for an additional option of living with the slight performance
penalty.  How often does this actually happen?  How much does it
actually slow things down compared to the speed of the NAND chip?

I'm hesitant to break something -- even if it's odd (literally in this
case) -- that currently works on most hardware, just because one or two
drivers can't handle it.  It feels kind of like changing the read() and
write() system calls to require cacheline alignment. :-P

> Signed-off-by: Marek Vasut <marex@denx.de>
> Cc: Scott Wood <scottwood@freescale.com>
> ---
>  common/cmd_nand.c |    9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/common/cmd_nand.c b/common/cmd_nand.c
> index a91ccf4..122a91c 100644
> --- a/common/cmd_nand.c
> +++ b/common/cmd_nand.c
> @@ -609,6 +609,8 @@ int do_nand(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[])
>  			goto usage;
>  
>  		addr = (ulong)simple_strtoul(argv[2], NULL, 16);
> +		if (!cacheline_aligned(addr))
> +			return 1;

There's no way you can just return like this without printing an error
that lets the user know that the operation wasn't performed, and why.

-Scott
Tom Rini June 25, 2012, 6:43 p.m. UTC | #2
On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
> On 06/24/2012 07:17 PM, Marek Vasut wrote:
> > This prevents the scenario where data cache is on and the
> > device uses DMA to deploy data. In that case, it might not
> > be possible to flush/invalidate data to RAM properly. The
> > other option is to use bounce buffer,
> 
> Or get cache coherent hardware. :-)
> 
> > but that involves a lot of copying and therefore degrades performance
> > rapidly. Therefore disallow this possibility of unaligned load
> > address altogether if data cache is on.
> 
> How about use the bounce buffer only if the address is misaligned?  The
> corrective action a user has to take is the same as with this patch,
> except for an additional option of living with the slight performance
> penalty.  How often does this actually happen?  How much does it
> actually slow things down compared to the speed of the NAND chip?

We would need to architect things such that any 'load' command would be
routed through this logic.  A solvable problem for sure, but a little
bit larger than just for NAND :)

> I'm hesitant to break something -- even if it's odd (literally in this
> case) -- that currently works on most hardware, just because one or two
> drivers can't handle it.  It feels kind of like changing the read() and
> write() system calls to require cacheline alignment. :-P

I don't want to get into an ARM vs PowerPC argument.  I think the best
answer is that I'm not sure having things unaligned works totally right
today as I did a silly test of loading a uImage to 0x82000001 and bootm
hung inside of U-Boot not long ago.  Can you try that on some cache
coherent hardware and see if that works?

[snip]
> > @@ -609,6 +609,8 @@ int do_nand(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[])
> >  			goto usage;
> >  
> >  		addr = (ulong)simple_strtoul(argv[2], NULL, 16);
> > +		if (!cacheline_aligned(addr))
> > +			return 1;
> 
> There's no way you can just return like this without printing an error
> that lets the user know that the operation wasn't performed, and why.

Note that cacheline_aligned does the error print (patch 2 of 9).
Scott Wood June 25, 2012, 8:08 p.m. UTC | #3
On 06/25/2012 01:43 PM, Tom Rini wrote:
> On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
>> On 06/24/2012 07:17 PM, Marek Vasut wrote:
>>> This prevents the scenario where data cache is on and the
>>> device uses DMA to deploy data. In that case, it might not
>>> be possible to flush/invalidate data to RAM properly. The
>>> other option is to use bounce buffer,
>>
>> Or get cache coherent hardware. :-)
>>
>>> but that involves a lot of copying and therefore degrades performance
>>> rapidly. Therefore disallow this possibility of unaligned load
>>> address altogether if data cache is on.
>>
>> How about use the bounce buffer only if the address is misaligned?  The
>> corrective action a user has to take is the same as with this patch,
>> except for an additional option of living with the slight performance
>> penalty.  How often does this actually happen?  How much does it
>> actually slow things down compared to the speed of the NAND chip?
> 
> We would need to architect things such that any 'load' command would be
> routed through this logic.

It's something the driver backend should handle (possibly via a common
helper library).  The fact that you can't do a DMA transfer to an
unaligned buffer is a hardware-specific detail, just as is the fact that
you're setting up a DMA buffer in the first place.

>> I'm hesitant to break something -- even if it's odd (literally in this
>> case) -- that currently works on most hardware, just because one or two
>> drivers can't handle it.  It feels kind of like changing the read() and
>> write() system calls to require cacheline alignment. :-P
> 
> I don't want to get into an ARM vs PowerPC argument.  I think the best
> answer is that I'm not sure having things unaligned works totally right
> today as I did a silly test of loading a uImage to 0x82000001 and bootm
> hung inside of U-Boot not long ago.  Can you try that on some cache
> coherent hardware and see if that works?

I'm not sure what bootm has to do with nand (and the fact that some ppc
is cache coherent actually doesn't matter, since we don't do DMA for
NAND), but I was able to bootm from an odd RAM address, and "nand read"
to an odd RAM address, on p5020ds.

> [snip]
>>> @@ -609,6 +609,8 @@ int do_nand(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[])
>>>  			goto usage;
>>>  
>>>  		addr = (ulong)simple_strtoul(argv[2], NULL, 16);
>>> +		if (!cacheline_aligned(addr))
>>> +			return 1;
>>
>> There's no way you can just return like this without printing an error
>> that lets the user know that the operation wasn't performed, and why.
> 
> Note that cacheline_aligned does the error print (patch 2 of 9).

Ah.  I saw this patch first since it was CCed to me. :-)

-Scott
Tom Rini June 25, 2012, 8:48 p.m. UTC | #4
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/25/2012 01:08 PM, Scott Wood wrote:
> On 06/25/2012 01:43 PM, Tom Rini wrote:
>> On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
>>> On 06/24/2012 07:17 PM, Marek Vasut wrote:
>>>> This prevents the scenario where data cache is on and the 
>>>> device uses DMA to deploy data. In that case, it might not be
>>>> possible to flush/invalidate data to RAM properly. The other
>>>> option is to use bounce buffer,
>>> 
>>> Or get cache coherent hardware. :-)
>>> 
>>>> but that involves a lot of copying and therefore degrades
>>>> performance rapidly. Therefore disallow this possibility of
>>>> unaligned load address altogether if data cache is on.
>>> 
>>> How about use the bounce buffer only if the address is
>>> misaligned?  The corrective action a user has to take is the
>>> same as with this patch, except for an additional option of
>>> living with the slight performance penalty.  How often does
>>> this actually happen?  How much does it actually slow things
>>> down compared to the speed of the NAND chip?
>> 
>> We would need to architect things such that any 'load' command
>> would be routed through this logic.
> 
> It's something the driver backend should handle (possibly via a
> common helper library).  The fact that you can't do a DMA transfer
> to an unaligned buffer is a hardware-specific detail, just as is
> the fact that you're setting up a DMA buffer in the first place.

Right.  What I'm trying to say is it's not a NAND problem it's an
unaligned addresses problem so the solution needs to be easily used
everywhere.

>>> I'm hesitant to break something -- even if it's odd (literally
>>> in this case) -- that currently works on most hardware, just
>>> because one or two drivers can't handle it.  It feels kind of
>>> like changing the read() and write() system calls to require
>>> cacheline alignment. :-P
>> 
>> I don't want to get into an ARM vs PowerPC argument.  I think the
>> best answer is that I'm not sure having things unaligned works
>> totally right today as I did a silly test of loading a uImage to
>> 0x82000001 and bootm hung inside of U-Boot not long ago.  Can you
>> try that on some cache coherent hardware and see if that works?
> 
> I'm not sure what bootm has to do with nand (and the fact that some
> ppc is cache coherent actually doesn't matter, since we don't do
> DMA for NAND), but I was able to bootm from an odd RAM address, and
> "nand read" to an odd RAM address, on p5020ds.

On ARM-land we have a lot of problems with unaligned addresses, even
with cache off.  I went to reproduce the original bootm problem and
ran into fatload hanging.  tftp didn't fail but bootm hangs.

- -- 
Tom
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJP6M6BAAoJENk4IS6UOR1WQXkP/1L0lbnKcpJ5jY9oCyIA29zE
r2eNVmsetudSTVC1rbQsmr5qd15xmzdqIi4yGVxOQIj6rj5eAhwRt8W6vRplNlSy
zRSFcZTh8YGdmNyLeiSiXFj/sxppIeSwPJqm1b17U7O43W8G4OVh73Ox1s7/YVB+
ag49VFokK0RBjgY8vux3YeJtpbDVTS0ZAmOXkoxLpvQgctttpnyDjoz9nbhfLA9d
BjmpVHKKkwKChhIuNyexMXregkZzfVAw9hJ9N3SCv0/x4VylRQBq3QNroHQJdGGq
0H0FRjAz9iLFu/EsHjmvzPgpCWeWgFLeGh1YLWORSZD3M8YlQvj5j6mSUsZjr8Ab
DJOyvZRsYgOK+h/+CIPdbGufY0yQfBPJPvzYkhxPAvs8zKkSwVYRm7xqPbSQNxYt
uBJh7Xw3M+rGVmqGmd6bERgeFOiL7BV3Jtfd3YaDQmJqiPS4r2nqwAjiDNaVUYlu
9TaxkbyTQl5PyCA+hkiwO4ebnSRgaGLrgnNZfkIuZD0qcpJ3A/Zldos2YXherwPc
/azJiitQvRrxxRU8RHEg1vkTOnX4urQi2cfPap+QiTb5KT+Gxrg6UPwBWvsOmDmY
MqIsHwgdpZflOOOX9Vco8ViSE44ePYaQDRLNhHbrdCAYEIEOxRIReAHs9YVVO2my
6OzYbd4sBdJDe/wl8s8X
=7znQ
-----END PGP SIGNATURE-----
Scott Wood June 25, 2012, 9:17 p.m. UTC | #5
On 06/25/2012 03:48 PM, Tom Rini wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 06/25/2012 01:08 PM, Scott Wood wrote:
>> On 06/25/2012 01:43 PM, Tom Rini wrote:
>>> On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
>>>> On 06/24/2012 07:17 PM, Marek Vasut wrote:
>>>>> This prevents the scenario where data cache is on and the 
>>>>> device uses DMA to deploy data. In that case, it might not be
>>>>> possible to flush/invalidate data to RAM properly. The other
>>>>> option is to use bounce buffer,
>>>>
>>>> Or get cache coherent hardware. :-)
>>>>
>>>>> but that involves a lot of copying and therefore degrades
>>>>> performance rapidly. Therefore disallow this possibility of
>>>>> unaligned load address altogether if data cache is on.
>>>>
>>>> How about use the bounce buffer only if the address is
>>>> misaligned?  The corrective action a user has to take is the
>>>> same as with this patch, except for an additional option of
>>>> living with the slight performance penalty.  How often does
>>>> this actually happen?  How much does it actually slow things
>>>> down compared to the speed of the NAND chip?
>>>
>>> We would need to architect things such that any 'load' command
>>> would be routed through this logic.
>>
>> It's something the driver backend should handle (possibly via a
>> common helper library).  The fact that you can't do a DMA transfer
>> to an unaligned buffer is a hardware-specific detail, just as is
>> the fact that you're setting up a DMA buffer in the first place.
> 
> Right.  What I'm trying to say is it's not a NAND problem it's an
> unaligned addresses problem so the solution needs to be easily used
> everywhere.

OK, so fix it in each driver that has this issue.  A lot of drivers are
probably not so performance critical that you can't just always use a
bounce buffer.  A static buffer plus memcpy isn't that burdensome --
it's close to what the drivers for non-DMA hardware do.  For higher
performance peripherals, throw in an if-statement or two.  It doesn't
seem like something that needs a U-Boot-wide change.

In the specific case of NAND, how many NAND drivers use DMA at all?

>> I'm not sure what bootm has to do with nand (and the fact that some
>> ppc is cache coherent actually doesn't matter, since we don't do
>> DMA for NAND), but I was able to bootm from an odd RAM address, and
>> "nand read" to an odd RAM address, on p5020ds.
> 
> On ARM-land we have a lot of problems with unaligned addresses, even
> with cache off.  I went to reproduce the original bootm problem and
> ran into fatload hanging.  tftp didn't fail but bootm hangs.

Maybe you can't take alignment exceptions during bootm?  PPC doesn't
normally take alignment checks, but we would have trouble with this
scenario if it did, since bootm clobbers the exception vectors.

-Scott
Tom Rini June 25, 2012, 9:22 p.m. UTC | #6
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/25/2012 02:17 PM, Scott Wood wrote:
> On 06/25/2012 03:48 PM, Tom Rini wrote:
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>> 
>> On 06/25/2012 01:08 PM, Scott Wood wrote:
>>> On 06/25/2012 01:43 PM, Tom Rini wrote:
>>>> On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
>>>>> On 06/24/2012 07:17 PM, Marek Vasut wrote:
>>>>>> This prevents the scenario where data cache is on and the
>>>>>>  device uses DMA to deploy data. In that case, it might
>>>>>> not be possible to flush/invalidate data to RAM properly.
>>>>>> The other option is to use bounce buffer,
>>>>> 
>>>>> Or get cache coherent hardware. :-)
>>>>> 
>>>>>> but that involves a lot of copying and therefore
>>>>>> degrades performance rapidly. Therefore disallow this
>>>>>> possibility of unaligned load address altogether if data
>>>>>> cache is on.
>>>>> 
>>>>> How about use the bounce buffer only if the address is 
>>>>> misaligned?  The corrective action a user has to take is
>>>>> the same as with this patch, except for an additional
>>>>> option of living with the slight performance penalty.  How
>>>>> often does this actually happen?  How much does it actually
>>>>> slow things down compared to the speed of the NAND chip?
>>>> 
>>>> We would need to architect things such that any 'load'
>>>> command would be routed through this logic.
>>> 
>>> It's something the driver backend should handle (possibly via
>>> a common helper library).  The fact that you can't do a DMA
>>> transfer to an unaligned buffer is a hardware-specific detail,
>>> just as is the fact that you're setting up a DMA buffer in the
>>> first place.
>> 
>> Right.  What I'm trying to say is it's not a NAND problem it's
>> an unaligned addresses problem so the solution needs to be easily
>> used everywhere.
> 
> OK, so fix it in each driver that has this issue.  A lot of drivers
> are probably not so performance critical that you can't just always
> use a bounce buffer.  A static buffer plus memcpy isn't that
> burdensome -- it's close to what the drivers for non-DMA hardware
> do.  For higher performance peripherals, throw in an if-statement
> or two.  It doesn't seem like something that needs a U-Boot-wide
> change.
> 
> In the specific case of NAND, how many NAND drivers use DMA at
> all?

Off hand I'm not sure.  I know there've been patches for OMAP parts
before but they had some unaddressed feedback and aren't in mainline.

>>> I'm not sure what bootm has to do with nand (and the fact that
>>> some ppc is cache coherent actually doesn't matter, since we
>>> don't do DMA for NAND), but I was able to bootm from an odd RAM
>>> address, and "nand read" to an odd RAM address, on p5020ds.
>> 
>> On ARM-land we have a lot of problems with unaligned addresses,
>> even with cache off.  I went to reproduce the original bootm
>> problem and ran into fatload hanging.  tftp didn't fail but bootm
>> hangs.
> 
> Maybe you can't take alignment exceptions during bootm?  PPC
> doesn't normally take alignment checks, but we would have trouble
> with this scenario if it did, since bootm clobbers the exception
> vectors.

Could be it.  But again, fatload via mmc also chokes, so there's a few
and quite long-standing problems.  What Marek and I are arguing for I
believe is to say it's not worth the effort to enable this corner
case.  Since it works on other arches, we shouldn't make this a silent
change that slips in, certainly, but at the end of the day it's highly
unlikely someone is using this in the wild and won't be able to work
around it in their next design or field upgrade.  I would hope at least.

- -- 
Tom
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJP6NaYAAoJENk4IS6UOR1WSogP/RX/VzrN0NzhApkVg38Bal7M
zgbjYS/Mub12oJmzU+iDosVC/HS6RqZb+LoLRlXROLAUka6SxUub/DtJATQ5ivRd
Y+x17Z7rdu4Q3dIcYV27RkJOCiXCRkUVFvulZEfzXlsLpoTC3PbiAA06xDuxHePJ
85liIhxMfTvwUGH0b4umtvy5vrlgop/K6Vdba9SPJwiS/pfW6hIj9ZnnFmJYY6aN
e1IqYjcQtDpxBttdy4PPZr4l0Qx+cYonHs9oFpgnibMsgr7FilqM8fVMq7iHGPYU
acX1YpvOZCWQsv3HcQdq/SrjnRJSDdcySV5q1xEZbpgw0iYbkT9lBb5GX4QcXr43
3zuvLlcvdPz1NzFp4v5gRWUnGaKvCUN5rQHT4UFQDfUErAHuA4q6xjnCe6bYD/EF
kETrAc8pN3sKwJTMJyE1LU2fFfaUirtggT3gWljfgKkfWmbbLnPRTUSkTFWRwu3S
QEcsGD8X+XUsiW9h/mZOlAuspzd6oxOOUCy54NfxkMQZy6YDGFFkPrzme6ztxIlz
O2oSSUSmsWFZbWHVxX/YejZta/RNp+3R9c37J8qFhDku19gMDo+RVn49vKW3dBxp
yb62WkxgO20990kAi7yDfy0XUlJR0WYdiUFiG273TFxNLcanHyct1JELf6zckxju
gBKoz+Ddlqtc7v5AWsuK
=Pyx6
-----END PGP SIGNATURE-----
Marek Vasut June 25, 2012, 11:37 p.m. UTC | #7
Dear Scott Wood,

> On 06/24/2012 07:17 PM, Marek Vasut wrote:
> > This prevents the scenario where data cache is on and the
> > device uses DMA to deploy data. In that case, it might not
> > be possible to flush/invalidate data to RAM properly. The
> > other option is to use bounce buffer,
> 
> Or get cache coherent hardware. :-)

Oh ... you mean powerpc? Or rather something like this 
http://cache.freescale.com/files/32bit/doc/fact_sheet/QORIQLS2FAMILYFS.pdf ? :-D

> > but that involves a lot of copying and therefore degrades performance
> > rapidly. Therefore disallow this possibility of unaligned load
> > address altogether if data cache is on.
> 
> How about use the bounce buffer only if the address is misaligned?

Not happening, bounce buffer is bullshit, especially if we can prevent it by 
teaching user not to do retarded things.

It's like driving a car in the wrong lane. Sure, you can do it, but it'll 
eventually have some consequences. And using a bounce buffer is like driving a 
tank in the wrong lane ...

> The
> corrective action a user has to take is the same as with this patch,
> except for an additional option of living with the slight performance
> penalty.

Slight is very weak word here.

> How often does this actually happen?  How much does it
> actually slow things down compared to the speed of the NAND chip?

If the user is dumb, always. But if you tell the user how to milk the most of 
the hardware, he'll be happier.

> I'm hesitant to break something -- even if it's odd (literally in this
> case) -- that currently works on most hardware, just because one or two
> drivers can't handle it.  It feels kind of like changing the read() and
> write() system calls to require cacheline alignment. :-P

That's actually almost right, we're doing a bootloader here, it might have 
limitations. We're not writing yet another operating system with no bounds on 
possibilities!

> > Signed-off-by: Marek Vasut <marex@denx.de>
> > Cc: Scott Wood <scottwood@freescale.com>
> > ---
> > 
> >  common/cmd_nand.c |    9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/common/cmd_nand.c b/common/cmd_nand.c
> > index a91ccf4..122a91c 100644
> > --- a/common/cmd_nand.c
> > +++ b/common/cmd_nand.c
> > @@ -609,6 +609,8 @@ int do_nand(cmd_tbl_t * cmdtp, int flag, int argc,
> > char * const argv[])
> > 
> >  			goto usage;
> >  		
> >  		addr = (ulong)simple_strtoul(argv[2], NULL, 16);
> > 
> > +		if (!cacheline_aligned(addr))
> > +			return 1;
> 
> There's no way you can just return like this without printing an error
> that lets the user know that the operation wasn't performed, and why.
> 
> -Scott

Best regards,
Marek Vasut
Marek Vasut June 25, 2012, 11:38 p.m. UTC | #8
Dear Tom Rini,

> On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
> > On 06/24/2012 07:17 PM, Marek Vasut wrote:
> > > This prevents the scenario where data cache is on and the
> > > device uses DMA to deploy data. In that case, it might not
> > > be possible to flush/invalidate data to RAM properly. The
> > > other option is to use bounce buffer,
> > 
> > Or get cache coherent hardware. :-)
> > 
> > > but that involves a lot of copying and therefore degrades performance
> > > rapidly. Therefore disallow this possibility of unaligned load
> > > address altogether if data cache is on.
> > 
> > How about use the bounce buffer only if the address is misaligned?  The
> > corrective action a user has to take is the same as with this patch,
> > except for an additional option of living with the slight performance
> > penalty.  How often does this actually happen?  How much does it
> > actually slow things down compared to the speed of the NAND chip?
> 
> We would need to architect things such that any 'load' command would be
> routed through this logic.  A solvable problem for sure, but a little
> bit larger than just for NAND :)

It's not only NAND, it's the whole user interface that's broken right now.

> > I'm hesitant to break something -- even if it's odd (literally in this
> > case) -- that currently works on most hardware, just because one or two
> > drivers can't handle it.  It feels kind of like changing the read() and
> > write() system calls to require cacheline alignment. :-P
> 
> I don't want to get into an ARM vs PowerPC argument.

ARM is better anyway :-D

> I think the best
> answer is that I'm not sure having things unaligned works totally right
> today as I did a silly test of loading a uImage to 0x82000001 and bootm
> hung inside of U-Boot not long ago.  Can you try that on some cache
> coherent hardware and see if that works?

+1

[...]

Best regards,
Marek Vasut
Marek Vasut June 25, 2012, 11:42 p.m. UTC | #9
Dear Scott Wood,

> On 06/25/2012 03:48 PM, Tom Rini wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> > 
> > On 06/25/2012 01:08 PM, Scott Wood wrote:
> >> On 06/25/2012 01:43 PM, Tom Rini wrote:
> >>> On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
> >>>> On 06/24/2012 07:17 PM, Marek Vasut wrote:
> >>>>> This prevents the scenario where data cache is on and the
> >>>>> device uses DMA to deploy data. In that case, it might not be
> >>>>> possible to flush/invalidate data to RAM properly. The other
> >>>>> option is to use bounce buffer,
> >>>> 
> >>>> Or get cache coherent hardware. :-)
> >>>> 
> >>>>> but that involves a lot of copying and therefore degrades
> >>>>> performance rapidly. Therefore disallow this possibility of
> >>>>> unaligned load address altogether if data cache is on.
> >>>> 
> >>>> How about use the bounce buffer only if the address is
> >>>> misaligned?  The corrective action a user has to take is the
> >>>> same as with this patch, except for an additional option of
> >>>> living with the slight performance penalty.  How often does
> >>>> this actually happen?  How much does it actually slow things
> >>>> down compared to the speed of the NAND chip?
> >>> 
> >>> We would need to architect things such that any 'load' command
> >>> would be routed through this logic.
> >> 
> >> It's something the driver backend should handle (possibly via a
> >> common helper library).  The fact that you can't do a DMA transfer
> >> to an unaligned buffer is a hardware-specific detail, just as is
> >> the fact that you're setting up a DMA buffer in the first place.
> > 
> > Right.  What I'm trying to say is it's not a NAND problem it's an
> > unaligned addresses problem so the solution needs to be easily used
> > everywhere.
> 
> OK, so fix it in each driver that has this issue.  A lot of drivers are
> probably not so performance critical that you can't just always use a
> bounce buffer.  A static buffer plus memcpy isn't that burdensome --
> it's close to what the drivers for non-DMA hardware do.  For higher
> performance peripherals, throw in an if-statement or two.  It doesn't
> seem like something that needs a U-Boot-wide change.

This is flat bull. I don't want bounce buffers growing all around uboot, see my 
previous email. I'm 120% firm in that.

And btw it's not about bounce buffers, it's also about other code (like FS code) 
which does unaligned accesses and we're fixing it.

> 
> In the specific case of NAND, how many NAND drivers use DMA at all?

Many do, it's not only nand, it's all over the place.

SPI, NAND, MMC etc.

> >> I'm not sure what bootm has to do with nand (and the fact that some
> >> ppc is cache coherent actually doesn't matter, since we don't do
> >> DMA for NAND), but I was able to bootm from an odd RAM address, and
> >> "nand read" to an odd RAM address, on p5020ds.
> > 
> > On ARM-land we have a lot of problems with unaligned addresses, even
> > with cache off.  I went to reproduce the original bootm problem and
> > ran into fatload hanging.  tftp didn't fail but bootm hangs.
> 
> Maybe you can't take alignment exceptions during bootm?  PPC doesn't
> normally take alignment checks, but we would have trouble with this
> scenario if it did, since bootm clobbers the exception vectors.
> 
> -Scott

Best regards,
Marek Vasut
Scott Wood June 25, 2012, 11:57 p.m. UTC | #10
On 06/25/2012 06:37 PM, Marek Vasut wrote:
> Dear Scott Wood,
> 
>> On 06/24/2012 07:17 PM, Marek Vasut wrote:
>>> This prevents the scenario where data cache is on and the
>>> device uses DMA to deploy data. In that case, it might not
>>> be possible to flush/invalidate data to RAM properly. The
>>> other option is to use bounce buffer,
>>
>> Or get cache coherent hardware. :-)
> 
> Oh ... you mean powerpc? Or rather something like this 
> http://cache.freescale.com/files/32bit/doc/fact_sheet/QORIQLS2FAMILYFS.pdf ? :-D

The word "coherent/coherency" appears 5 times in that document by my
count. :-)

I hope that applies to DMA, not just core-to-core.

>>> but that involves a lot of copying and therefore degrades performance
>>> rapidly. Therefore disallow this possibility of unaligned load
>>> address altogether if data cache is on.
>>
>> How about use the bounce buffer only if the address is misaligned?
> 
> Not happening, bounce buffer is bullshit,

Hacking up the common frontend with a new limitation because you can't
be bothered to fix your drivers is bullshit.

> It's like driving a car in the wrong lane. Sure, you can do it, but it'll 
> eventually have some consequences. And using a bounce buffer is like driving a 
> tank in the wrong lane ...

Using a bounce buffer is like parking your car before going into the
building, rather than insisting the building's hallways be paved.

>> The
>> corrective action a user has to take is the same as with this patch,
>> except for an additional option of living with the slight performance
>> penalty.
> 
> Slight is very weak word here.

Prove me wrong with benchmarks.

>> How often does this actually happen?  How much does it
>> actually slow things down compared to the speed of the NAND chip?
> 
> If the user is dumb, always. But if you tell the user how to milk the most of 
> the hardware, he'll be happier.

So, if you use bounce buffers conditionally (based on whether the
address is misaligned), there's no impact except to "dumb" users, and
for those users they would merely get a performance degradation rather
than breakage.  How is this "bullshit"?

>> I'm hesitant to break something -- even if it's odd (literally in this
>> case) -- that currently works on most hardware, just because one or two
>> drivers can't handle it.  It feels kind of like changing the read() and
>> write() system calls to require cacheline alignment. :-P
> 
> That's actually almost right, we're doing a bootloader here, it might have 
> limitations. We're not writing yet another operating system with no bounds on 
> possibilities!

We also don't need to bend over backwards to squeeze every last cycle
out of the boot process, at the expense of a stable user interface (not
to mention requiring the user to know the system's cache line size).

-Scott
Scott Wood June 26, 2012, 12:37 a.m. UTC | #11
On 06/25/2012 06:42 PM, Marek Vasut wrote:
> Dear Scott Wood,
> 
>> On 06/25/2012 03:48 PM, Tom Rini wrote:
>>> Right.  What I'm trying to say is it's not a NAND problem it's an
>>> unaligned addresses problem so the solution needs to be easily used
>>> everywhere.
>>
>> OK, so fix it in each driver that has this issue.  A lot of drivers are
>> probably not so performance critical that you can't just always use a
>> bounce buffer.  A static buffer plus memcpy isn't that burdensome --
>> it's close to what the drivers for non-DMA hardware do.  For higher
>> performance peripherals, throw in an if-statement or two.  It doesn't
>> seem like something that needs a U-Boot-wide change.
> 
> This is flat bull. I don't want bounce buffers growing all around uboot, see my 
> previous email. I'm 120% firm in that.

I'm 150% firm that you're going to need a better justification to change
the user interface for everybody.

What's next, restricting "cp" (or even "cp.b") in case someone wants to
use a DMA engine to copy the bits?

Note that in the case of "nand read.oob", depending on NAND page size
and platform, there's a good chance that you're imposing an alignment
restriction that is larger than the data being transferred even if the
user asks to read the entire OOB.

What about "nand write.yaffs2" or multi-page "nand read.raw", which deal
with arrays of interleaved main+spare?  With a small page NAND chip,
you'll need cache lines that are 16 bytes or smaller to avoid unaligned
transactions -- and those will bypass your front-end check (unless the
user is so "stupid" as to want to modify the data after a raw-read, and
do a raw-write of a particular page).

> And btw it's not about bounce buffers, it's also about other code (like FS code) 
> which does unaligned accesses and we're fixing it.

Fixing the alignment of U-Boot-generated buffers is good.  That doesn't
affect the user interface.  This is different.

>> In the specific case of NAND, how many NAND drivers use DMA at all?
> 
> Many do, 

How many?  Specifically, how many that have alignment restrictions, that
would need to be fixed?

> it's not only nand, it's all over the place.

This patch is about NAND.

-Scott
Marek Vasut June 26, 2012, 1:16 a.m. UTC | #12
Dear Scott Wood,

> On 06/25/2012 06:42 PM, Marek Vasut wrote:
> > Dear Scott Wood,
> > 
> >> On 06/25/2012 03:48 PM, Tom Rini wrote:
> >>> Right.  What I'm trying to say is it's not a NAND problem it's an
> >>> unaligned addresses problem so the solution needs to be easily used
> >>> everywhere.
> >> 
> >> OK, so fix it in each driver that has this issue.  A lot of drivers are
> >> probably not so performance critical that you can't just always use a
> >> bounce buffer.  A static buffer plus memcpy isn't that burdensome --
> >> it's close to what the drivers for non-DMA hardware do.  For higher
> >> performance peripherals, throw in an if-statement or two.  It doesn't
> >> seem like something that needs a U-Boot-wide change.
> > 
> > This is flat bull. I don't want bounce buffers growing all around uboot,
> > see my previous email. I'm 120% firm in that.
> 
> I'm 150% firm that you're going to need a better justification to change
> the user interface for everybody.

I'm trying to make it user-proof. Honestly, loading to some bullshit unaligned 
address is a corner case, so it won't affect many people. And that minority 
who'll be affected will adjust their stuff by a few bytes, no problem there.

> What's next, restricting "cp" (or even "cp.b") in case someone wants to
> use a DMA engine to copy the bits?

Ad absurdum indeed, but it's partly valid point. You'd need to restrict it 
somehow if that'd be the case, therefore I'll add it to the DM. Drivers should 
be able to specify their requirements.

> Note that in the case of "nand read.oob", depending on NAND page size
> and platform, there's a good chance that you're imposing an alignment
> restriction that is larger than the data being transferred even if the
> user asks to read the entire OOB.

I don't think I completely follow here.

> What about "nand write.yaffs2" or multi-page "nand read.raw", which deal
> with arrays of interleaved main+spare?  With a small page NAND chip,
> you'll need cache lines that are 16 bytes or smaller to avoid unaligned
> transactions -- and those will bypass your front-end check (unless the
> user is so "stupid" as to want to modify the data after a raw-read, and
> do a raw-write of a particular page).

Ok, I think I'm very stupid now, probably since I have high fever. I'll read 
this after I sleep. Sorry Scott, I'm sure you're rolling out a valid point, it's 
just that I'm incapable of understanding it right now.

> > And btw it's not about bounce buffers, it's also about other code (like
> > FS code) which does unaligned accesses and we're fixing it.
> 
> Fixing the alignment of U-Boot-generated buffers is good.  That doesn't
> affect the user interface.  This is different.

And it's the easy part.

> >> In the specific case of NAND, how many NAND drivers use DMA at all?
> > 
> > Many do,
> 
> How many?  Specifically, how many that have alignment restrictions, that
> would need to be fixed?

At least all on the ARM architecture. And more will come, since ARM is on the 
rise.

> > it's not only nand, it's all over the place.
> 
> This patch is about NAND.

Check the whole patchset ... and that still only covers a small part of it all.

> -Scott

Best regards,
Marek Vasut
Marek Vasut June 26, 2012, 1:33 a.m. UTC | #13
Dear Scott Wood,

> On 06/25/2012 06:37 PM, Marek Vasut wrote:
> > Dear Scott Wood,
> > 
> >> On 06/24/2012 07:17 PM, Marek Vasut wrote:
> >>> This prevents the scenario where data cache is on and the
> >>> device uses DMA to deploy data. In that case, it might not
> >>> be possible to flush/invalidate data to RAM properly. The
> >>> other option is to use bounce buffer,
> >> 
> >> Or get cache coherent hardware. :-)
> > 
> > Oh ... you mean powerpc? Or rather something like this
> > http://cache.freescale.com/files/32bit/doc/fact_sheet/QORIQLS2FAMILYFS.pd
> > f ? :-D
> 
> The word "coherent/coherency" appears 5 times in that document by my
> count. :-)
> 
> I hope that applies to DMA, not just core-to-core.
> 
> >>> but that involves a lot of copying and therefore degrades performance
> >>> rapidly. Therefore disallow this possibility of unaligned load
> >>> address altogether if data cache is on.
> >> 
> >> How about use the bounce buffer only if the address is misaligned?
> > 
> > Not happening, bounce buffer is bullshit,
> 
> Hacking up the common frontend with a new limitation because you can't
> be bothered to fix your drivers is bullshit.

The drivers are not broken, they have hardware limitations. And checking for 
those has to be done as early as possible. And it's not a new common frontend!

> > It's like driving a car in the wrong lane. Sure, you can do it, but it'll
> > eventually have some consequences. And using a bounce buffer is like
> > driving a tank in the wrong lane ...
> 
> Using a bounce buffer is like parking your car before going into the
> building, rather than insisting the building's hallways be paved.

The other is obviously faster, more comfortable and lets you carry more stuff at 
once. And if you drive a truck, you can dump a lot of payload instead of 
carrying it back and forth from the building. That's why there's a special 
garage for trucks possibly with cargo elevators etc.

> >> The
> >> corrective action a user has to take is the same as with this patch,
> >> except for an additional option of living with the slight performance
> >> penalty.
> > 
> > Slight is very weak word here.
> 
> Prove me wrong with benchmarks.

Well, copying data back and forth is tremendous overhead. You don't need a 
benchmark to calculate something like this:

133MHz SDRAM (pumped) gives you what ... 133 Mb/s throughput
(now if it's DDR, dual/quad pumped, that doesn't give you any more advantage 
since you have to: send address, read the data, send address, write the data ... 
this is expensive ... without data cache on, even more so)

Now consider you do it via really dump memcpy, what happens:
1) You need to read the data into register
1a) Send address
1b) Read the data into register
2) You need to write the data to a new location
2a) Send address
2b) Write the data into the memory

In the meantime, you get some refresh cycles etc. Now, if you take read and 
write in 1 time unit and "send address" in 0.5 time unit (this gives total 3 
time units per one loop) and consider you're not doing sustained read/write, you 
should see you'll be able to copy at speed of about 133/3 ~= 40Mb/s

If you want to load 3MB kernel at 40Mb/s onto an unaligned address via DMA, the 
DMA will deploy it via sustained write, that'll be at 10MB/s, therefore in 
300ms. But the subsequent copy will take another 600ms.

And now, I need someone to recalculate it. Also, read up here, it is _VERY_ good 
and it is certainly more accurate than my previous delirious attempt:

http://www.akkadia.org/drepper/cpumemory.pdf

> >> How often does this actually happen?  How much does it
> >> actually slow things down compared to the speed of the NAND chip?
> > 
> > If the user is dumb, always. But if you tell the user how to milk the
> > most of the hardware, he'll be happier.
> 
> So, if you use bounce buffers conditionally (based on whether the
> address is misaligned), there's no impact except to "dumb" users, and
> for those users they would merely get a performance degradation rather
> than breakage.  How is this "bullshit"?

Correct, but users will complain if they get a subpar performance.

> >> I'm hesitant to break something -- even if it's odd (literally in this
> >> case) -- that currently works on most hardware, just because one or two
> >> drivers can't handle it.  It feels kind of like changing the read() and
> >> write() system calls to require cacheline alignment. :-P
> > 
> > That's actually almost right, we're doing a bootloader here, it might
> > have limitations. We're not writing yet another operating system with no
> > bounds on possibilities!
> 
> We also don't need to bend over backwards to squeeze every last cycle
> out of the boot process, at the expense of a stable user interface (not
> to mention requiring the user to know the system's cache line size).

But that's reported in my patch ;-)

And yes, we want to make the boot process as blazing fast as possible. Imagine 
you fire a rocket into the deep space and it gets broken and needs reboot, will 
you enjoy the waiting ? ;-)

> -Scott

Best regards,
Marek Vasut
Scott Wood June 26, 2012, 7:25 p.m. UTC | #14
On 06/25/2012 08:33 PM, Marek Vasut wrote:
> Dear Scott Wood,
> 
>> On 06/25/2012 06:37 PM, Marek Vasut wrote:
>>> Dear Scott Wood,
>>>
>>>> On 06/24/2012 07:17 PM, Marek Vasut wrote:
>>>>> but that involves a lot of copying and therefore degrades performance
>>>>> rapidly. Therefore disallow this possibility of unaligned load
>>>>> address altogether if data cache is on.
>>>>
>>>> How about use the bounce buffer only if the address is misaligned?
>>>
>>> Not happening, bounce buffer is bullshit,
>>
>> Hacking up the common frontend with a new limitation because you can't
>> be bothered to fix your drivers is bullshit.
> 
> The drivers are not broken, they have hardware limitations. 

They're broken because they ignore those limitations.

> And checking for 
> those has to be done as early as possible.

Why?

> And it's not a new common frontend!

No, it's a compatibility-breaking change to the existing common frontend.

>>> It's like driving a car in the wrong lane. Sure, you can do it, but it'll
>>> eventually have some consequences. And using a bounce buffer is like
>>> driving a tank in the wrong lane ...
>>
>> Using a bounce buffer is like parking your car before going into the
>> building, rather than insisting the building's hallways be paved.
> 
> The other is obviously faster, more comfortable and lets you carry more stuff at 
> once.

Then you end up needing buildings to be many times as large to give
every cubicle an adjacent parking spot, maneuvering room, etc.  You'll
be breathing fumes all day, and it'll be a lot less comfortable to get
even across the hallway without using a car, etc.  Communication between
coworkers would be limited to horns and obscene gestures. :-)

> And if you drive a truck, you can dump a lot of payload instead of 
> carrying it back and forth from the building. That's why there's a special 
> garage for trucks possibly with cargo elevators etc.

Yes, it's called targeted optimization rather than premature optimization.

>>>> The
>>>> corrective action a user has to take is the same as with this patch,
>>>> except for an additional option of living with the slight performance
>>>> penalty.
>>>
>>> Slight is very weak word here.
>>
>> Prove me wrong with benchmarks.
> 
> Well, copying data back and forth is tremendous overhead. You don't need a 
> benchmark to calculate something like this:
> 
> 133MHz SDRAM (pumped) gives you what ... 133 Mb/s throughput

You're saying you get only a little more bandwidth from memory than
you'd get from a 100 Mb/s ethernet port?  Come on.  Data buses are not
one bit wide.

And how fast can you pull data out of a NAND chip, even with DMA?

> (now if it's DDR, dual/quad pumped, that doesn't give you any more advantage 

So such things were implemented for fun?

> since you have to: send address, read the data, send address, write the data ... 

What about bursts?  I'm pretty sure you don't have to send the address
separately for every single byte.

> this is expensive ... without data cache on, even more so)

Why do we care about "without data cache"?  You don't need the bounce
buffer in that case.

> Now consider you do it via really dump memcpy, what happens:

It looks like ARM U-Boot has an optimized memcpy.

> 1) You need to read the data into register
> 1a) Send address
> 1b) Read the data into register
> 2) You need to write the data to a new location
> 2a) Send address
> 2b) Write the data into the memory
> 
> In the meantime, you get some refresh cycles etc. Now, if you take read and 
> write in 1 time unit and "send address" in 0.5 time unit (this gives total 3 
> time units per one loop) and consider you're not doing sustained read/write, you 
> should see you'll be able to copy at speed of about 133/3 ~= 40Mb/s
> 
> If you want to load 3MB kernel at 40Mb/s onto an unaligned address via DMA, the 
> DMA will deploy it via sustained write, that'll be at 10MB/s, therefore in 
> 300ms. But the subsequent copy will take another 600ms.

On a p5020ds, using NAND hardware that doesn't do DMA at all, I'm able
to load a 3MiB image from NAND in around 300-400 ms.  This is with using
memcpy_fromio() on an uncached hardware buffer.

Again, I'm not saying that bounce buffers are always negligible overhead
-- just that I doubt NAND is fast enough that it makes a huge difference
in this specific case.

>>>> How often does this actually happen?  How much does it
>>>> actually slow things down compared to the speed of the NAND chip?
>>>
>>> If the user is dumb, always. But if you tell the user how to milk the
>>> most of the hardware, he'll be happier.
>>
>> So, if you use bounce buffers conditionally (based on whether the
>> address is misaligned), there's no impact except to "dumb" users, and
>> for those users they would merely get a performance degradation rather
>> than breakage.  How is this "bullshit"?
> 
> Correct, but users will complain if they get a subpar performance.

If you expend the minimal effort required to make the bounce buffer
usage conditional on the address actually being misaligned, the only
users that will see subpar performance are those who would see breakage
with your approach.  Users will complain if they see breakage even more
than when the see subpar performance.

If the issue is educating the user to avoid the performance hit
(regardless of magnitude), and you care enough, have the driver print a
warning (not error) message the first time it needs to use a bounce buffer.

-Scott
Scott Wood June 26, 2012, 7:38 p.m. UTC | #15
On 06/25/2012 08:16 PM, Marek Vasut wrote:
> Dear Scott Wood,
>> Note that in the case of "nand read.oob", depending on NAND page size
>> and platform, there's a good chance that you're imposing an alignment
>> restriction that is larger than the data being transferred even if the
>> user asks to read the entire OOB.
> 
> I don't think I completely follow here.

Why should I need to have 64-byte alignment for transfering a 16-byte OOB?

>> What about "nand write.yaffs2" or multi-page "nand read.raw", which deal
>> with arrays of interleaved main+spare?  With a small page NAND chip,
>> you'll need cache lines that are 16 bytes or smaller to avoid unaligned
>> transactions -- and those will bypass your front-end check (unless the
>> user is so "stupid" as to want to modify the data after a raw-read, and
>> do a raw-write of a particular page).
> 
> Ok, I think I'm very stupid now, probably since I have high fever. I'll read 
> this after I sleep. Sorry Scott, I'm sure you're rolling out a valid point, it's 
> just that I'm incapable of understanding it right now.

Probably best to wait until you're feeling better for the rest of it,
too.  Hope you get well soon.

>>>> In the specific case of NAND, how many NAND drivers use DMA at all?
>>>
>>> Many do,
>>
>> How many?  Specifically, how many that have alignment restrictions, that
>> would need to be fixed?
> 
> At least all on the ARM architecture. And more will come, since ARM is on the 
> rise.

atmel: no, doesn't use DMA
davinci: no, doesn't use DMA
kb9202: no, doesn't use DMA
kirkwood: no, doesn't use DMA
mxc: I don't think this uses DMA, but this driver scares me. :-)
mxs: Even scarier.  Looks like this one does use DMA.
omap_gpmc: no, doesn't use DMA
s3c64xx: no, doesn't use DMA
spr: no, doesn't use DMA
s3c2410: no, doesn't use DMA

If this is about not wanting to touch the mxs_nand driver again, I
sympathize. :-)

>>> it's not only nand, it's all over the place.
>>
>> This patch is about NAND.
> 
> Check the whole patchset ... and that still only covers a small part of it all.

Right, I think it's wrong elsewhere too when it's user interface, but
I'll let the relevant custodians argue those cases.

-Scott
Marek Vasut June 26, 2012, 8:39 p.m. UTC | #16
Dear Scott Wood,

> On 06/25/2012 08:33 PM, Marek Vasut wrote:
> > Dear Scott Wood,
> > 
> >> On 06/25/2012 06:37 PM, Marek Vasut wrote:
> >>> Dear Scott Wood,
> >>> 
> >>>> On 06/24/2012 07:17 PM, Marek Vasut wrote:
> >>>>> but that involves a lot of copying and therefore degrades performance
> >>>>> rapidly. Therefore disallow this possibility of unaligned load
> >>>>> address altogether if data cache is on.
> >>>> 
> >>>> How about use the bounce buffer only if the address is misaligned?
> >>> 
> >>> Not happening, bounce buffer is bullshit,
> >> 
> >> Hacking up the common frontend with a new limitation because you can't
> >> be bothered to fix your drivers is bullshit.
> > 
> > The drivers are not broken, they have hardware limitations.
> 
> They're broken because they ignore those limitations.
> 
> > And checking for
> > those has to be done as early as possible.
> 
> Why?

Why keep an overhead.

> > And it's not a new common frontend!
> 
> No, it's a compatibility-breaking change to the existing common frontend.

Well, those are corner cases, it's not like the people will start hitting it en-
masse.

I agree it should be somehow platform or even CPU specific.

> >>> It's like driving a car in the wrong lane. Sure, you can do it, but
> >>> it'll eventually have some consequences. And using a bounce buffer is
> >>> like driving a tank in the wrong lane ...
> >> 
> >> Using a bounce buffer is like parking your car before going into the
> >> building, rather than insisting the building's hallways be paved.
> > 
> > The other is obviously faster, more comfortable and lets you carry more
> > stuff at once.
> 
> Then you end up needing buildings to be many times as large to give
> every cubicle an adjacent parking spot, maneuvering room, etc.  You'll
> be breathing fumes all day, and it'll be a lot less comfortable to get
> even across the hallway without using a car, etc.  Communication between
> coworkers would be limited to horns and obscene gestures. :-)

Ok, this has gone waaay out of hand here :-)

> > And if you drive a truck, you can dump a lot of payload instead of
> > carrying it back and forth from the building. That's why there's a
> > special garage for trucks possibly with cargo elevators etc.
> 
> Yes, it's called targeted optimization rather than premature optimization.
> 
> >>>> The
> >>>> corrective action a user has to take is the same as with this patch,
> >>>> except for an additional option of living with the slight performance
> >>>> penalty.
> >>> 
> >>> Slight is very weak word here.
> >> 
> >> Prove me wrong with benchmarks.
> > 
> > Well, copying data back and forth is tremendous overhead. You don't need
> > a benchmark to calculate something like this:
> > 
> > 133MHz SDRAM (pumped) gives you what ... 133 Mb/s throughput
> 
> You're saying you get only a little more bandwidth from memory than
> you'd get from a 100 Mb/s ethernet port?  Come on.  Data buses are not
> one bit wide.

Good point, it was too late and I forgot to count that in.

> And how fast can you pull data out of a NAND chip, even with DMA?
> 
> > (now if it's DDR, dual/quad pumped, that doesn't give you any more
> > advantage
> 
> So such things were implemented for fun?
> 
> > since you have to: send address, read the data, send address, write the
> > data ...
> 
> What about bursts?  I'm pretty sure you don't have to send the address
> separately for every single byte.

If you do memcpy? You only have registers, sure, you can optimize it, but on 
intel for example, you don't have many of those.

> > this is expensive ... without data cache on, even more so)
> 
> Why do we care about "without data cache"?  You don't need the bounce
> buffer in that case.

Correct, it's expensive in both cases though.

> > Now consider you do it via really dump memcpy, what happens:
> It looks like ARM U-Boot has an optimized memcpy.
> 
> > 1) You need to read the data into register
> > 1a) Send address
> > 1b) Read the data into register
> > 2) You need to write the data to a new location
> > 2a) Send address
> > 2b) Write the data into the memory
> > 
> > In the meantime, you get some refresh cycles etc. Now, if you take read
> > and write in 1 time unit and "send address" in 0.5 time unit (this gives
> > total 3 time units per one loop) and consider you're not doing sustained
> > read/write, you should see you'll be able to copy at speed of about
> > 133/3 ~= 40Mb/s
> > 
> > If you want to load 3MB kernel at 40Mb/s onto an unaligned address via
> > DMA, the DMA will deploy it via sustained write, that'll be at 10MB/s,
> > therefore in 300ms. But the subsequent copy will take another 600ms.
> 
> On a p5020ds, using NAND hardware that doesn't do DMA at all, I'm able
> to load a 3MiB image from NAND in around 300-400 ms.  This is with using
> memcpy_fromio() on an uncached hardware buffer.

blazing almost half a second ... but everyone these days wants a faster boot, 
without memcpy, it can go down to 100ms or even less! And this kind of 
limitation is not something that'd inconvenience anyone.

> Again, I'm not saying that bounce buffers are always negligible overhead
> -- just that I doubt NAND is fast enough that it makes a huge difference
> in this specific case.

It does make a difference. Correct, thinking about it -- implementing a generic 
bounce buffer for cache-impotent hardware might be a better way to go.

> >>>> How often does this actually happen?  How much does it
> >>>> actually slow things down compared to the speed of the NAND chip?
> >>> 
> >>> If the user is dumb, always. But if you tell the user how to milk the
> >>> most of the hardware, he'll be happier.
> >> 
> >> So, if you use bounce buffers conditionally (based on whether the
> >> address is misaligned), there's no impact except to "dumb" users, and
> >> for those users they would merely get a performance degradation rather
> >> than breakage.  How is this "bullshit"?
> > 
> > Correct, but users will complain if they get a subpar performance.
> 
> If you expend the minimal effort required to make the bounce buffer
> usage conditional on the address actually being misaligned, the only
> users that will see subpar performance are those who would see breakage
> with your approach.  Users will complain if they see breakage even more
> than when the see subpar performance.
> 
> If the issue is educating the user to avoid the performance hit
> (regardless of magnitude), and you care enough, have the driver print a
> warning (not error) message the first time it needs to use a bounce buffer.

Ok, on a second thought, you're right. Let's do it the same way we did it with 
mmc.

> -Scott

Best regards,
Marek Vasut
Aneesh V July 7, 2012, 3:05 a.m. UTC | #17
On 06/24/2012 05:17 PM, Marek Vasut wrote:
> This prevents the scenario where data cache is on and the
> device uses DMA to deploy data. In that case, it might not
> be possible to flush/invalidate data to RAM properly. The
> other option is to use bounce buffer, but that involves a
> lot of copying and therefore degrades performance rapidly.
> Therefore disallow this possibility of unaligned load address
> altogether if data cache is on.
>
> Signed-off-by: Marek Vasut<marex@denx.de>
> Cc: Scott Wood<scottwood@freescale.com>
> ---
>   common/cmd_nand.c |    9 +++++++++
>   1 file changed, 9 insertions(+)
>
> diff --git a/common/cmd_nand.c b/common/cmd_nand.c
> index a91ccf4..122a91c 100644
> --- a/common/cmd_nand.c
> +++ b/common/cmd_nand.c
> @@ -609,6 +609,8 @@ int do_nand(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[])
>   			goto usage;
>
>   		addr = (ulong)simple_strtoul(argv[2], NULL, 16);
> +		if (!cacheline_aligned(addr))
> +			return 1;

You need to check the end address too. Also, I agree with Scott that
that this is an un-justifiable restriction on cache-coherent
architectures. IMO, such checks should be done in platform specific
code where the DMA is being attempted.

best regards,
Aneesh
diff mbox

Patch

diff --git a/common/cmd_nand.c b/common/cmd_nand.c
index a91ccf4..122a91c 100644
--- a/common/cmd_nand.c
+++ b/common/cmd_nand.c
@@ -609,6 +609,8 @@  int do_nand(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[])
 			goto usage;
 
 		addr = (ulong)simple_strtoul(argv[2], NULL, 16);
+		if (!cacheline_aligned(addr))
+			return 1;
 
 		read = strncmp(cmd, "read", 4) == 0; /* 1 = read, 0 = write */
 		printf("\nNAND %s: ", read ? "read" : "write");
@@ -924,6 +926,9 @@  int do_nandboot(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[])
 				addr = simple_strtoul(argv[1], NULL, 16);
 			else
 				addr = CONFIG_SYS_LOAD_ADDR;
+			if (!cacheline_aligned(addr))
+				return 1;
+
 			return nand_load_image(cmdtp, &nand_info[dev->id->num],
 					       part->offset, addr, argv[0]);
 		}
@@ -956,6 +961,10 @@  usage:
 		bootstage_error(BOOTSTAGE_ID_NAND_SUFFIX);
 		return CMD_RET_USAGE;
 	}
+
+	if (!cacheline_aligned(addr))
+		return 1;
+
 	bootstage_mark(BOOTSTAGE_ID_NAND_SUFFIX);
 
 	if (!boot_device) {