Message ID | 1422449743-10119-3-git-send-email-p.marczak@samsung.com |
---|---|
State | Superseded |
Delegated to: | Łukasz Majewski |
Headers | show |
Hello Przemyslaw, On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak <p.marczak@samsung.com> wrote: > For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, > will highly increase the memset/memcpy performance. This is able > thanks to the ARM multiple register instructions. > > Unfortunatelly the relocation is done without the cache enabled, > so it takes some time, but zeroing the BSS memory takes much more > longer, especially for the configs with big static buffers. > > A quick test confirms, that the boot time improvement after using > the arch memcpy for relocation has no significant meaning. > The same test confirms that enable the memset for zeroing BSS, > reduces the boot time. > > So this patch enables the arch memset for zeroing the BSS after > the relocation process. For ARM boards, this can be enabled > in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'. Since the issue is that zeroing is done one word at a time, could we not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do a double (possibly quadruple) write loop? That would avoid calling a libc routine from the almost sole file in U-Boot where a C environment is not necessarily granted. Amicalement,
On 2 Feb 2015, bpringlemeir@nbsps.com wrote: > On 31 Jan 2015, albert.u.boot@aribaud.net wrote: > >> Hello Przemyslaw, >> >> On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak >> <p.marczak@samsung.com> wrote: >>> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, >>> will highly increase the memset/memcpy performance. This is able >>> thanks to the ARM multiple register instructions. >>> >>> Unfortunatelly the relocation is done without the cache enabled, >>> so it takes some time, but zeroing the BSS memory takes much more >>> longer, especially for the configs with big static buffers. >>> >>> A quick test confirms, that the boot time improvement after using >>> the arch memcpy for relocation has no significant meaning. >>> The same test confirms that enable the memset for zeroing BSS, >>> reduces the boot time. >>> >>> So this patch enables the arch memset for zeroing the BSS after >>> the relocation process. For ARM boards, this can be enabled >>> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'. > >> Since the issue is that zeroing is done one word at a time, could we >> not simply clear r3 as well as r2 (possibly even r4 and r5 too) and >> do a double (possibly quadruple) write loop? That would avoid calling >> a libc routine from the almost sole file in U-Boot where a C >> environment is not necessarily granted. I thought the same thing... > diff --git a/arch/arm/lib/crt0.S b/arch/arm/lib/crt0.S index > 22df3e5..fab3d2c 100644 --- a/arch/arm/lib/crt0.S +++ > b/arch/arm/lib/crt0.S @@ -115,14 +115,22 @@ here: bl > c_runtime_cpu_setup /* we still call old routine here */ > > ldr r0, =__bss_start /* this is auto-relocated! */ > - ldr r1, =__bss_end /* this is auto-relocated! */ > > +#ifdef CONFIG_USE_ARCH_MEMSET > + ldr r3, =__bss_end /* this is auto-relocated! */ > + mov r1, #0x00000000 /* prepare zero to clear BSS */ > + > + subs r2, r3, r0 /* r2 = memset len */ > + bl memset > +#else > + ldr r1, =__bss_end /* this is auto-relocated! */ > mov r2, #0x00000000 /* prepare zero to clear BSS */ > > clbss_l:cmp r0, r1 /* while not at end of BSS */ > strlo r2, [r0] /* clear 32-bit BSS word */ > addlo r0, r0, #4 /* move to next */ > blo clbss_l > +#endif This is great news (increase boot speed). Maybe if this files wasn't conditional? Assuming the the 'BSS' is aligned (LDS enforced), ldr r0, =__bss_start /* this is auto-relocated! */ ldr r1, =__bss_end /* this is auto-relocated! */ + mov r2, #0 /* prepare zero to clear BSS */ + mov r3, #0 + mov r4, #0 + mov r5, #0 + mov r6, #0 + mov r7, #0 + mov r8, #0 + mov lr, #0 + b 1f + .align 4 clbss_l: + stmia r0!, {r2-r8,lr} /* clear 32 BSS word */ + stmia r0!, {r2-r8,lr} /* easy to unroll */ + 1: cmp r0, r1 blo clbss_l The code should work on all ARM versions? Then every platform would benefit. I think the larger portion of the 'ARCH memset' is to handle alignment issues. Sometimes the tail/head portion can be handled efficiently with 'strd', etc which is only on some CPUs. It is easy to setup the BSS so that both the head/tail are aligned.... but I think the above code only requires multiples of 32bytes as total BSS size (it is easy to jump into the middle of an unrolled loop with 'add pc, rn<<2, #constant'). The size/iteration can be easily tweaked (8/16/32 bytes). At least in principal *if* there is some size alignment on BSS it is fairly easy to write some generic ARM to quickly clear the BSS that will be just as competitive as any ARCH memset version. The above code is adding about 13 words of code. Fwiw, Bill Pringlemeir.
On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote: > Hello Przemyslaw, > > On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak > <p.marczak@samsung.com> wrote: > > For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, > > will highly increase the memset/memcpy performance. This is able > > thanks to the ARM multiple register instructions. > > > > Unfortunatelly the relocation is done without the cache enabled, > > so it takes some time, but zeroing the BSS memory takes much more > > longer, especially for the configs with big static buffers. > > > > A quick test confirms, that the boot time improvement after using > > the arch memcpy for relocation has no significant meaning. > > The same test confirms that enable the memset for zeroing BSS, > > reduces the boot time. > > > > So this patch enables the arch memset for zeroing the BSS after > > the relocation process. For ARM boards, this can be enabled > > in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'. > > Since the issue is that zeroing is done one word at a time, could we > not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do > a double (possibly quadruple) write loop? That would avoid calling a > libc routine from the almost sole file in U-Boot where a C environment > is not necessarily granted. So this brings up something I've wondered about for a long while. We have arch/arm/lib/mem{set,cpy}.S which are old copies from the linux kernel. The kernel uses them for all ARM platforms. Why do we not always use these functions? I have a very vague notion it was a size thing...
Hi Tom, > On Feb 2, 2015, at 19:25 , Tom Rini <trini@ti.com> wrote: > > On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote: >> Hello Przemyslaw, >> >> On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak >> <p.marczak@samsung.com> wrote: >>> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, >>> will highly increase the memset/memcpy performance. This is able >>> thanks to the ARM multiple register instructions. >>> >>> Unfortunatelly the relocation is done without the cache enabled, >>> so it takes some time, but zeroing the BSS memory takes much more >>> longer, especially for the configs with big static buffers. >>> >>> A quick test confirms, that the boot time improvement after using >>> the arch memcpy for relocation has no significant meaning. >>> The same test confirms that enable the memset for zeroing BSS, >>> reduces the boot time. >>> >>> So this patch enables the arch memset for zeroing the BSS after >>> the relocation process. For ARM boards, this can be enabled >>> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'. >> >> Since the issue is that zeroing is done one word at a time, could we >> not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do >> a double (possibly quadruple) write loop? That would avoid calling a >> libc routine from the almost sole file in U-Boot where a C environment >> is not necessarily granted. > > So this brings up something I've wondered about for a long while. We > have arch/arm/lib/mem{set,cpy}.S which are old copies from the linux > kernel. The kernel uses them for all ARM platforms. Why do we not > always use these functions? I have a very vague notion it was a size > thing… > That is a good question. Are we being hobbled cause of MLO? If so we can use the short (and slow) methods in that case and use the fast methods in the normal case. It seems that this is warranted in this case. However in the particular case of dfu I think it’s best to avoid the large static buffers. Or if we do use the large buffers let’s put them in a linker segment that does not get zeroed on start. > -- > Tom Regards — Pantelis
On Mon, Feb 02, 2015 at 07:28:14PM +0200, Pantelis Antoniou wrote: > Hi Tom, > > > On Feb 2, 2015, at 19:25 , Tom Rini <trini@ti.com> wrote: > > > > On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote: > >> Hello Przemyslaw, > >> > >> On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak > >> <p.marczak@samsung.com> wrote: > >>> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, > >>> will highly increase the memset/memcpy performance. This is able > >>> thanks to the ARM multiple register instructions. > >>> > >>> Unfortunatelly the relocation is done without the cache enabled, > >>> so it takes some time, but zeroing the BSS memory takes much more > >>> longer, especially for the configs with big static buffers. > >>> > >>> A quick test confirms, that the boot time improvement after using > >>> the arch memcpy for relocation has no significant meaning. > >>> The same test confirms that enable the memset for zeroing BSS, > >>> reduces the boot time. > >>> > >>> So this patch enables the arch memset for zeroing the BSS after > >>> the relocation process. For ARM boards, this can be enabled > >>> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'. > >> > >> Since the issue is that zeroing is done one word at a time, could we > >> not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do > >> a double (possibly quadruple) write loop? That would avoid calling a > >> libc routine from the almost sole file in U-Boot where a C environment > >> is not necessarily granted. > > > > So this brings up something I've wondered about for a long while. We > > have arch/arm/lib/mem{set,cpy}.S which are old copies from the linux > > kernel. The kernel uses them for all ARM platforms. Why do we not > > always use these functions? I have a very vague notion it was a size > > thing… > > That is a good question. Are we being hobbled cause of MLO? If so we can > use the short (and slow) methods in that case and use the fast methods > in the normal case. It seems that this is warranted in this case. I'm not sure, but I can test easily enough. But even then we may want to opt a few targets in to the current (slow) path and make the default the optimized path. > However in the particular case of dfu I think it’s best to avoid the large > static buffers. Or if we do use the large buffers let’s put them in a > linker segment that does not get zeroed on start. Yes, I owe the rest of the series my attention too :)
On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote: > Hello Przemyslaw, > > On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak > <p.marczak@samsung.com> wrote: > > For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, > > will highly increase the memset/memcpy performance. This is able > > thanks to the ARM multiple register instructions. > > > > Unfortunatelly the relocation is done without the cache enabled, > > so it takes some time, but zeroing the BSS memory takes much more > > longer, especially for the configs with big static buffers. > > > > A quick test confirms, that the boot time improvement after using > > the arch memcpy for relocation has no significant meaning. > > The same test confirms that enable the memset for zeroing BSS, > > reduces the boot time. > > > > So this patch enables the arch memset for zeroing the BSS after > > the relocation process. For ARM boards, this can be enabled > > in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'. > > Since the issue is that zeroing is done one word at a time, could we > not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do > a double (possibly quadruple) write loop? That would avoid calling a > libc routine from the almost sole file in U-Boot where a C environment > is not necessarily granted. I want to jump up here again. Note that the arch memset/memcpy routines are in asm and I don't belive require a C environment. Why don't we simply use the asm versions for everyone and backport whatever we need from the kernel to re-sync there as it's not a choice there and it's a performance win too?
Hello Tom, On 02/12/2015 04:37 PM, Tom Rini wrote: > On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote: >> Hello Przemyslaw, >> >> On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak >> <p.marczak@samsung.com> wrote: >>> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, >>> will highly increase the memset/memcpy performance. This is able >>> thanks to the ARM multiple register instructions. >>> >>> Unfortunatelly the relocation is done without the cache enabled, >>> so it takes some time, but zeroing the BSS memory takes much more >>> longer, especially for the configs with big static buffers. >>> >>> A quick test confirms, that the boot time improvement after using >>> the arch memcpy for relocation has no significant meaning. >>> The same test confirms that enable the memset for zeroing BSS, >>> reduces the boot time. >>> >>> So this patch enables the arch memset for zeroing the BSS after >>> the relocation process. For ARM boards, this can be enabled >>> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'. >> >> Since the issue is that zeroing is done one word at a time, could we >> not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do >> a double (possibly quadruple) write loop? That would avoid calling a >> libc routine from the almost sole file in U-Boot where a C environment >> is not necessarily granted. > > I want to jump up here again. Note that the arch memset/memcpy routines > are in asm and I don't belive require a C environment. Why don't we > simply use the asm versions for everyone and backport whatever we need > from the kernel to re-sync there as it's not a choice there and it's a > performance win too? > Right, for ARM the mentioned routines doesn't require C env. But if we could achieve some improvement in this place, then maybe it has sense to add some new code just for bss. I will try to combine and make some timing tests on Monday. Best regards,
diff --git a/arch/arm/lib/crt0.S b/arch/arm/lib/crt0.S index 22df3e5..fab3d2c 100644 --- a/arch/arm/lib/crt0.S +++ b/arch/arm/lib/crt0.S @@ -115,14 +115,22 @@ here: bl c_runtime_cpu_setup /* we still call old routine here */ ldr r0, =__bss_start /* this is auto-relocated! */ - ldr r1, =__bss_end /* this is auto-relocated! */ +#ifdef CONFIG_USE_ARCH_MEMSET + ldr r3, =__bss_end /* this is auto-relocated! */ + mov r1, #0x00000000 /* prepare zero to clear BSS */ + + subs r2, r3, r0 /* r2 = memset len */ + bl memset +#else + ldr r1, =__bss_end /* this is auto-relocated! */ mov r2, #0x00000000 /* prepare zero to clear BSS */ clbss_l:cmp r0, r1 /* while not at end of BSS */ strlo r2, [r0] /* clear 32-bit BSS word */ addlo r0, r0, #4 /* move to next */ blo clbss_l +#endif bl coloured_LED_init bl red_led_on
For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, will highly increase the memset/memcpy performance. This is able thanks to the ARM multiple register instructions. Unfortunatelly the relocation is done without the cache enabled, so it takes some time, but zeroing the BSS memory takes much more longer, especially for the configs with big static buffers. A quick test confirms, that the boot time improvement after using the arch memcpy for relocation has no significant meaning. The same test confirms that enable the memset for zeroing BSS, reduces the boot time. So this patch enables the arch memset for zeroing the BSS after the relocation process. For ARM boards, this can be enabled in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'. This was tested on Trats2. A quick test with trace. Boot time from start to main_loop() entry: - ~1384ms - before this change - ~888ms - after this change Signed-off-by: Przemyslaw Marczak <p.marczak@samsung.com> Cc: Albert Aribaud <albert.u.boot@aribaud.net> Cc: Tom Rini <trini@ti.com> --- arch/arm/lib/crt0.S | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)