diff mbox

[U-Boot,v2,2/8] arm: relocation: clear .bss section with arch memset if defined

Message ID 1424099601-14979-3-git-send-email-p.marczak@samsung.com
State Changes Requested
Delegated to: Minkyu Kang
Headers show

Commit Message

Przemyslaw Marczak Feb. 16, 2015, 3:13 p.m. UTC
For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
will highly increase the memset/memcpy performance. This is able
thanks to the ARM multiple register instructions.

Unfortunatelly the relocation is done without the cache enabled,
so it takes some time, but zeroing the BSS memory takes much more
longer, especially for the configs with big static buffers.

A quick test confirms, that the boot time improvement after using
the arch memcpy for relocation has no significant meaning.
The same test confirms that enable the memset for zeroing BSS,
reduces the boot time.

So this patch enables the arch memset for zeroing the BSS after
the relocation process. For ARM boards, this can be enabled
in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.

This was tested on Trats2.
A quick test with trace. Boot time from start to main_loop() entry:
- ~1384ms - before this change
-  ~888ms - after this change

Signed-off-by: Przemyslaw Marczak <p.marczak@samsung.com>
Cc: Albert Aribaud <albert.u.boot@aribaud.net>
Cc: Tom Rini <trini@ti.com>
---
 arch/arm/lib/crt0.S | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Przemyslaw Marczak Feb. 16, 2015, 3:21 p.m. UTC | #1
Hello,

On 02/16/2015 04:13 PM, Przemyslaw Marczak wrote:
> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
> will highly increase the memset/memcpy performance. This is able
> thanks to the ARM multiple register instructions.
>
> Unfortunatelly the relocation is done without the cache enabled,
> so it takes some time, but zeroing the BSS memory takes much more
> longer, especially for the configs with big static buffers.
>
> A quick test confirms, that the boot time improvement after using
> the arch memcpy for relocation has no significant meaning.
> The same test confirms that enable the memset for zeroing BSS,
> reduces the boot time.
>
> So this patch enables the arch memset for zeroing the BSS after
> the relocation process. For ARM boards, this can be enabled
> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
>
> This was tested on Trats2.
> A quick test with trace. Boot time from start to main_loop() entry:
> - ~1384ms - before this change
> -  ~888ms - after this change
>
> Signed-off-by: Przemyslaw Marczak <p.marczak@samsung.com>
> Cc: Albert Aribaud <albert.u.boot@aribaud.net>
> Cc: Tom Rini <trini@ti.com>
> ---
>   arch/arm/lib/crt0.S | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm/lib/crt0.S b/arch/arm/lib/crt0.S
> index 22df3e5..fab3d2c 100644
> --- a/arch/arm/lib/crt0.S
> +++ b/arch/arm/lib/crt0.S
> @@ -115,14 +115,22 @@ here:
>   	bl	c_runtime_cpu_setup	/* we still call old routine here */
>
>   	ldr	r0, =__bss_start	/* this is auto-relocated! */
> -	ldr	r1, =__bss_end		/* this is auto-relocated! */
>
> +#ifdef CONFIG_USE_ARCH_MEMSET
> +	ldr	r3, =__bss_end		/* this is auto-relocated! */
> +	mov	r1, #0x00000000		/* prepare zero to clear BSS */
> +
> +	subs	r2, r3, r0		/* r2 = memset len */
> +	bl	memset
> +#else
> +	ldr	r1, =__bss_end		/* this is auto-relocated! */
>   	mov	r2, #0x00000000		/* prepare zero to clear BSS */
>
>   clbss_l:cmp	r0, r1			/* while not at end of BSS */
>   	strlo	r2, [r0]		/* clear 32-bit BSS word */
>   	addlo	r0, r0, #4		/* move to next */
>   	blo	clbss_l
> +#endif
>
>   	bl coloured_LED_init
>   	bl red_led_on
>

This commit left unchanged. After boot time test using oscilloscope and 
the clock cycle counter I didn't noticed a time difference in more then 
one ms. In this case I think that insert a duplicated code here, has no 
sense.

Best regards,
Simon Glass Feb. 18, 2015, 4:32 a.m. UTC | #2
Hi Przemyslaw,

On 16 February 2015 at 08:21, Przemyslaw Marczak <p.marczak@samsung.com> wrote:
> Hello,
>
>
> On 02/16/2015 04:13 PM, Przemyslaw Marczak wrote:
>>
>> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
>> will highly increase the memset/memcpy performance. This is able
>> thanks to the ARM multiple register instructions.
>>
>> Unfortunatelly the relocation is done without the cache enabled,
>> so it takes some time, but zeroing the BSS memory takes much more
>> longer, especially for the configs with big static buffers.
>>
>> A quick test confirms, that the boot time improvement after using
>> the arch memcpy for relocation has no significant meaning.
>> The same test confirms that enable the memset for zeroing BSS,
>> reduces the boot time.
>>
>> So this patch enables the arch memset for zeroing the BSS after
>> the relocation process. For ARM boards, this can be enabled
>> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
>>
>> This was tested on Trats2.
>> A quick test with trace. Boot time from start to main_loop() entry:
>> - ~1384ms - before this change
>> -  ~888ms - after this change
>>
>> Signed-off-by: Przemyslaw Marczak <p.marczak@samsung.com>
>> Cc: Albert Aribaud <albert.u.boot@aribaud.net>
>> Cc: Tom Rini <trini@ti.com>
>> ---
>>   arch/arm/lib/crt0.S | 10 +++++++++-
>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm/lib/crt0.S b/arch/arm/lib/crt0.S
>> index 22df3e5..fab3d2c 100644
>> --- a/arch/arm/lib/crt0.S
>> +++ b/arch/arm/lib/crt0.S
>> @@ -115,14 +115,22 @@ here:
>>         bl      c_runtime_cpu_setup     /* we still call old routine here
>> */
>>
>>         ldr     r0, =__bss_start        /* this is auto-relocated! */
>> -       ldr     r1, =__bss_end          /* this is auto-relocated! */
>>
>> +#ifdef CONFIG_USE_ARCH_MEMSET
>> +       ldr     r3, =__bss_end          /* this is auto-relocated! */
>> +       mov     r1, #0x00000000         /* prepare zero to clear BSS */
>> +
>> +       subs    r2, r3, r0              /* r2 = memset len */
>> +       bl      memset
>> +#else
>> +       ldr     r1, =__bss_end          /* this is auto-relocated! */
>>         mov     r2, #0x00000000         /* prepare zero to clear BSS */
>>
>>   clbss_l:cmp   r0, r1                  /* while not at end of BSS */
>>         strlo   r2, [r0]                /* clear 32-bit BSS word */
>>         addlo   r0, r0, #4              /* move to next */
>>         blo     clbss_l
>> +#endif
>>
>>         bl coloured_LED_init
>>         bl red_led_on
>>
>
> This commit left unchanged. After boot time test using oscilloscope and the
> clock cycle counter I didn't noticed a time difference in more then one ms.
> In this case I think that insert a duplicated code here, has no sense.

I don't understand this comment, sorry.

Regards,
Simon
Przemyslaw Marczak Feb. 18, 2015, 12:31 p.m. UTC | #3
Hello,

On 02/18/2015 05:32 AM, Simon Glass wrote:
> Hi Przemyslaw,
>
> On 16 February 2015 at 08:21, Przemyslaw Marczak <p.marczak@samsung.com> wrote:
>> Hello,
>>
>>
>> On 02/16/2015 04:13 PM, Przemyslaw Marczak wrote:
>>>
>>> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
>>> will highly increase the memset/memcpy performance. This is able
>>> thanks to the ARM multiple register instructions.
>>>
>>> Unfortunatelly the relocation is done without the cache enabled,
>>> so it takes some time, but zeroing the BSS memory takes much more
>>> longer, especially for the configs with big static buffers.
>>>
>>> A quick test confirms, that the boot time improvement after using
>>> the arch memcpy for relocation has no significant meaning.
>>> The same test confirms that enable the memset for zeroing BSS,
>>> reduces the boot time.
>>>
>>> So this patch enables the arch memset for zeroing the BSS after
>>> the relocation process. For ARM boards, this can be enabled
>>> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
>>>
>>> This was tested on Trats2.
>>> A quick test with trace. Boot time from start to main_loop() entry:
>>> - ~1384ms - before this change
>>> -  ~888ms - after this change
>>>
>>> Signed-off-by: Przemyslaw Marczak <p.marczak@samsung.com>
>>> Cc: Albert Aribaud <albert.u.boot@aribaud.net>
>>> Cc: Tom Rini <trini@ti.com>
>>> ---
>>>    arch/arm/lib/crt0.S | 10 +++++++++-
>>>    1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm/lib/crt0.S b/arch/arm/lib/crt0.S
>>> index 22df3e5..fab3d2c 100644
>>> --- a/arch/arm/lib/crt0.S
>>> +++ b/arch/arm/lib/crt0.S
>>> @@ -115,14 +115,22 @@ here:
>>>          bl      c_runtime_cpu_setup     /* we still call old routine here
>>> */
>>>
>>>          ldr     r0, =__bss_start        /* this is auto-relocated! */
>>> -       ldr     r1, =__bss_end          /* this is auto-relocated! */
>>>
>>> +#ifdef CONFIG_USE_ARCH_MEMSET
>>> +       ldr     r3, =__bss_end          /* this is auto-relocated! */
>>> +       mov     r1, #0x00000000         /* prepare zero to clear BSS */
>>> +
>>> +       subs    r2, r3, r0              /* r2 = memset len */
>>> +       bl      memset
>>> +#else
>>> +       ldr     r1, =__bss_end          /* this is auto-relocated! */
>>>          mov     r2, #0x00000000         /* prepare zero to clear BSS */
>>>
>>>    clbss_l:cmp   r0, r1                  /* while not at end of BSS */
>>>          strlo   r2, [r0]                /* clear 32-bit BSS word */
>>>          addlo   r0, r0, #4              /* move to next */
>>>          blo     clbss_l
>>> +#endif
>>>
>>>          bl coloured_LED_init
>>>          bl red_led_on
>>>
>>
>> This commit left unchanged. After boot time test using oscilloscope and the
>> clock cycle counter I didn't noticed a time difference in more then one ms.
>> In this case I think that insert a duplicated code here, has no sense.
>
> I don't understand this comment, sorry.
>
> Regards,
> Simon
>

Sorry for the misleading message.
When I send this patch set, I forgot about adding the message-id of the 
previous thread as "in-reply-to".

There was a discussion about insert the memory zeroing routines as an 
asm here, instead of using the 'memset' call. But I tested that there is 
no difference in the performance. So in this case, it's better to use 
the common lib and this commit is the same as it was in the first version.

(I missed the changelog)

Best regards,
Simon Glass Feb. 19, 2015, 6:59 p.m. UTC | #4
On 18 February 2015 at 05:31, Przemyslaw Marczak <p.marczak@samsung.com> wrote:
> Hello,
>
>
> On 02/18/2015 05:32 AM, Simon Glass wrote:
>>
>> Hi Przemyslaw,
>>
>> On 16 February 2015 at 08:21, Przemyslaw Marczak <p.marczak@samsung.com>
>> wrote:
>>>
>>> Hello,
>>>
>>>
>>> On 02/16/2015 04:13 PM, Przemyslaw Marczak wrote:
>>>>
>>>>
>>>> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
>>>> will highly increase the memset/memcpy performance. This is able
>>>> thanks to the ARM multiple register instructions.
>>>>
>>>> Unfortunatelly the relocation is done without the cache enabled,
>>>> so it takes some time, but zeroing the BSS memory takes much more
>>>> longer, especially for the configs with big static buffers.
>>>>
>>>> A quick test confirms, that the boot time improvement after using
>>>> the arch memcpy for relocation has no significant meaning.
>>>> The same test confirms that enable the memset for zeroing BSS,
>>>> reduces the boot time.
>>>>
>>>> So this patch enables the arch memset for zeroing the BSS after
>>>> the relocation process. For ARM boards, this can be enabled
>>>> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
>>>>
>>>> This was tested on Trats2.
>>>> A quick test with trace. Boot time from start to main_loop() entry:
>>>> - ~1384ms - before this change
>>>> -  ~888ms - after this change
>>>>
>>>> Signed-off-by: Przemyslaw Marczak <p.marczak@samsung.com>
>>>> Cc: Albert Aribaud <albert.u.boot@aribaud.net>
>>>> Cc: Tom Rini <trini@ti.com>
>>>> ---
>>>>    arch/arm/lib/crt0.S | 10 +++++++++-
>>>>    1 file changed, 9 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/arm/lib/crt0.S b/arch/arm/lib/crt0.S
>>>> index 22df3e5..fab3d2c 100644
>>>> --- a/arch/arm/lib/crt0.S
>>>> +++ b/arch/arm/lib/crt0.S
>>>> @@ -115,14 +115,22 @@ here:
>>>>          bl      c_runtime_cpu_setup     /* we still call old routine
>>>> here
>>>> */
>>>>
>>>>          ldr     r0, =__bss_start        /* this is auto-relocated! */
>>>> -       ldr     r1, =__bss_end          /* this is auto-relocated! */
>>>>
>>>> +#ifdef CONFIG_USE_ARCH_MEMSET
>>>> +       ldr     r3, =__bss_end          /* this is auto-relocated! */
>>>> +       mov     r1, #0x00000000         /* prepare zero to clear BSS */
>>>> +
>>>> +       subs    r2, r3, r0              /* r2 = memset len */
>>>> +       bl      memset
>>>> +#else
>>>> +       ldr     r1, =__bss_end          /* this is auto-relocated! */
>>>>          mov     r2, #0x00000000         /* prepare zero to clear BSS */
>>>>
>>>>    clbss_l:cmp   r0, r1                  /* while not at end of BSS */
>>>>          strlo   r2, [r0]                /* clear 32-bit BSS word */
>>>>          addlo   r0, r0, #4              /* move to next */
>>>>          blo     clbss_l
>>>> +#endif
>>>>
>>>>          bl coloured_LED_init
>>>>          bl red_led_on
>>>>
>>>
>>> This commit left unchanged. After boot time test using oscilloscope and
>>> the
>>> clock cycle counter I didn't noticed a time difference in more then one
>>> ms.
>>> In this case I think that insert a duplicated code here, has no sense.
>>
>>
>> I don't understand this comment, sorry.
>>
>> Regards,
>> Simon
>>
>
> Sorry for the misleading message.
> When I send this patch set, I forgot about adding the message-id of the
> previous thread as "in-reply-to".
>
> There was a discussion about insert the memory zeroing routines as an asm
> here, instead of using the 'memset' call. But I tested that there is no
> difference in the performance. So in this case, it's better to use the
> common lib and this commit is the same as it was in the first version.
>
> (I missed the changelog)

I see, thanks.

Reviewed-by: Simon Glass <sjg@chromium.org>
diff mbox

Patch

diff --git a/arch/arm/lib/crt0.S b/arch/arm/lib/crt0.S
index 22df3e5..fab3d2c 100644
--- a/arch/arm/lib/crt0.S
+++ b/arch/arm/lib/crt0.S
@@ -115,14 +115,22 @@  here:
 	bl	c_runtime_cpu_setup	/* we still call old routine here */
 
 	ldr	r0, =__bss_start	/* this is auto-relocated! */
-	ldr	r1, =__bss_end		/* this is auto-relocated! */
 
+#ifdef CONFIG_USE_ARCH_MEMSET
+	ldr	r3, =__bss_end		/* this is auto-relocated! */
+	mov	r1, #0x00000000		/* prepare zero to clear BSS */
+
+	subs	r2, r3, r0		/* r2 = memset len */
+	bl	memset
+#else
+	ldr	r1, =__bss_end		/* this is auto-relocated! */
 	mov	r2, #0x00000000		/* prepare zero to clear BSS */
 
 clbss_l:cmp	r0, r1			/* while not at end of BSS */
 	strlo	r2, [r0]		/* clear 32-bit BSS word */
 	addlo	r0, r0, #4		/* move to next */
 	blo	clbss_l
+#endif
 
 	bl coloured_LED_init
 	bl red_led_on