diff mbox series

lzma: Fix decompression speed regression

Message ID 401a0fd05a7d601a5b8b0a250ad70608a42ed213.1688546052.git.christophe.leroy@csgroup.eu
State Accepted
Commit ad47974707da5377bab9011a887ec8e92f2b7ddb
Delegated to: Tom Rini
Headers show
Series lzma: Fix decompression speed regression | expand

Commit Message

Christophe Leroy July 5, 2023, 8:34 a.m. UTC
Uncompressing a 1.7Mbytes FIT image on U-boot 2023.04 takes
approx 7s on a powerpc 8xx.
The same on U-boot 2023.07-rc6 takes approx 28s unless watchdog
is disabled.

During that decompression, LzmaDec_DecodeReal() calls schedule
1.6 million times, that is every 4µs in average.

In the past it used to be a call to WATCHDOG_RESET() which was
just calling hw_watchdog_reset().

But the combination of commit 29caf9305b6 ("cyclic: Use schedule()
instead of WATCHDOG_RESET()") and commit 26e8ebcd7cb ("watchdog:
mpc8xxx: Make it generic") results in an heavier processing.

However, there is absolutely no point in calling schedule() that
often.

By moving and keeping only one call to schedule() in the main
loop the number of calls is reduced to 1.2 million which is still
too much. So add logic to only call schedule every 1024 times.
That leads to a call to schedule approx every 6ms which is still
far enough to entertain the watchdog which has a 1s timeout on
powerpc 8xx.

powerpc 8xx being one of the slowest targets we have today in
U-boot, and most other watchdogs having a timeout of one minutes
instead of one second like the 8xx, this fix should not have
negative impact on other targets.

Fixes: 29caf9305b6 ("cyclic: Use schedule() instead of WATCHDOG_RESET()")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 lib/lzma/LzmaDec.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

Comments

Simon Glass July 6, 2023, 3:57 p.m. UTC | #1
On Wed, 5 Jul 2023 at 09:54, Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> Uncompressing a 1.7Mbytes FIT image on U-boot 2023.04 takes
> approx 7s on a powerpc 8xx.
> The same on U-boot 2023.07-rc6 takes approx 28s unless watchdog
> is disabled.
>
> During that decompression, LzmaDec_DecodeReal() calls schedule
> 1.6 million times, that is every 4µs in average.
>
> In the past it used to be a call to WATCHDOG_RESET() which was
> just calling hw_watchdog_reset().
>
> But the combination of commit 29caf9305b6 ("cyclic: Use schedule()
> instead of WATCHDOG_RESET()") and commit 26e8ebcd7cb ("watchdog:
> mpc8xxx: Make it generic") results in an heavier processing.
>
> However, there is absolutely no point in calling schedule() that
> often.
>
> By moving and keeping only one call to schedule() in the main
> loop the number of calls is reduced to 1.2 million which is still
> too much. So add logic to only call schedule every 1024 times.
> That leads to a call to schedule approx every 6ms which is still
> far enough to entertain the watchdog which has a 1s timeout on
> powerpc 8xx.
>
> powerpc 8xx being one of the slowest targets we have today in
> U-boot, and most other watchdogs having a timeout of one minutes
> instead of one second like the 8xx, this fix should not have
> negative impact on other targets.
>
> Fixes: 29caf9305b6 ("cyclic: Use schedule() instead of WATCHDOG_RESET()")
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  lib/lzma/LzmaDec.c | 18 ++++--------------
>  1 file changed, 4 insertions(+), 14 deletions(-)

Reviewed-by: Simon Glass <sjg@chromium.org>

+Stefan Roese I wonder if we need a more general fix?
Stefan Roese July 7, 2023, 9:09 a.m. UTC | #2
Hi Simon,
Hi Christophe,

On 7/6/23 17:57, Simon Glass wrote:
> On Wed, 5 Jul 2023 at 09:54, Christophe Leroy
> <christophe.leroy@csgroup.eu> wrote:
>>
>> Uncompressing a 1.7Mbytes FIT image on U-boot 2023.04 takes
>> approx 7s on a powerpc 8xx.
>> The same on U-boot 2023.07-rc6 takes approx 28s unless watchdog
>> is disabled.
>>
>> During that decompression, LzmaDec_DecodeReal() calls schedule
>> 1.6 million times, that is every 4µs in average.
>>
>> In the past it used to be a call to WATCHDOG_RESET() which was
>> just calling hw_watchdog_reset().
>>
>> But the combination of commit 29caf9305b6 ("cyclic: Use schedule()
>> instead of WATCHDOG_RESET()") and commit 26e8ebcd7cb ("watchdog:
>> mpc8xxx: Make it generic") results in an heavier processing.
>>
>> However, there is absolutely no point in calling schedule() that
>> often.
>>
>> By moving and keeping only one call to schedule() in the main
>> loop the number of calls is reduced to 1.2 million which is still
>> too much. So add logic to only call schedule every 1024 times.
>> That leads to a call to schedule approx every 6ms which is still
>> far enough to entertain the watchdog which has a 1s timeout on
>> powerpc 8xx.
>>
>> powerpc 8xx being one of the slowest targets we have today in
>> U-boot, and most other watchdogs having a timeout of one minutes
>> instead of one second like the 8xx, this fix should not have
>> negative impact on other targets.
>>
>> Fixes: 29caf9305b6 ("cyclic: Use schedule() instead of WATCHDOG_RESET()")
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> ---
>>   lib/lzma/LzmaDec.c | 18 ++++--------------
>>   1 file changed, 4 insertions(+), 14 deletions(-)
> 
> Reviewed-by: Simon Glass <sjg@chromium.org>
> 
> +Stefan Roese I wonder if we need a more general fix?

IIRC, we already have some code in the watchdog IF making sure, that the
WDT reset is not called too often. I need to re-check here. And yes, we
should probably either move this code to the common schedule() function
or add it here. Best by adding a new Kconfig option, configuring the max
schedule frequency, e.g. 1000 Hz.

Thanks,
Stefan
Christophe Leroy July 7, 2023, 9:27 a.m. UTC | #3
Le 07/07/2023 à 11:09, Stefan Roese a écrit :
> [Vous ne recevez pas souvent de courriers de sr@denx.de. Découvrez 
> pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hi Simon,
> Hi Christophe,
> 
> On 7/6/23 17:57, Simon Glass wrote:
>> On Wed, 5 Jul 2023 at 09:54, Christophe Leroy
>> <christophe.leroy@csgroup.eu> wrote:
>>>
>>> Uncompressing a 1.7Mbytes FIT image on U-boot 2023.04 takes
>>> approx 7s on a powerpc 8xx.
>>> The same on U-boot 2023.07-rc6 takes approx 28s unless watchdog
>>> is disabled.
>>>
>>> During that decompression, LzmaDec_DecodeReal() calls schedule
>>> 1.6 million times, that is every 4µs in average.
>>>
>>> In the past it used to be a call to WATCHDOG_RESET() which was
>>> just calling hw_watchdog_reset().
>>>
>>> But the combination of commit 29caf9305b6 ("cyclic: Use schedule()
>>> instead of WATCHDOG_RESET()") and commit 26e8ebcd7cb ("watchdog:
>>> mpc8xxx: Make it generic") results in an heavier processing.
>>>
>>> However, there is absolutely no point in calling schedule() that
>>> often.
>>>
>>> By moving and keeping only one call to schedule() in the main
>>> loop the number of calls is reduced to 1.2 million which is still
>>> too much. So add logic to only call schedule every 1024 times.
>>> That leads to a call to schedule approx every 6ms which is still
>>> far enough to entertain the watchdog which has a 1s timeout on
>>> powerpc 8xx.
>>>
>>> powerpc 8xx being one of the slowest targets we have today in
>>> U-boot, and most other watchdogs having a timeout of one minutes
>>> instead of one second like the 8xx, this fix should not have
>>> negative impact on other targets.
>>>
>>> Fixes: 29caf9305b6 ("cyclic: Use schedule() instead of 
>>> WATCHDOG_RESET()")
>>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>>> ---
>>>   lib/lzma/LzmaDec.c | 18 ++++--------------
>>>   1 file changed, 4 insertions(+), 14 deletions(-)
>>
>> Reviewed-by: Simon Glass <sjg@chromium.org>
>>
>> +Stefan Roese I wonder if we need a more general fix?
> 
> IIRC, we already have some code in the watchdog IF making sure, that the
> WDT reset is not called too often. I need to re-check here. And yes, we
> should probably either move this code to the common schedule() function
> or add it here. Best by adding a new Kconfig option, configuring the max
> schedule frequency, e.g. 1000 Hz.

Yes we may add a ratelimitation inside schedule() but if we only do that 
it will only be a plaster. It will still use a huge amount of CPU just 
for reading the time and checking the limit.

A caller shouldn't call schedule() million of times per second, I think 
the root cause need to be fixed anyway. LZMA has this problem, ZLIB has 
not. I didn't check other decompressors.

The best would be that schedule() throw a WARN_ONCE() when that happens 
so that we can identify frantic callers.

Christophe
Stefan Roese July 7, 2023, 9:31 a.m. UTC | #4
Hi Christophe,

On 7/7/23 11:27, Christophe Leroy wrote:
> 
> 
> Le 07/07/2023 à 11:09, Stefan Roese a écrit :
>> [Vous ne recevez pas souvent de courriers de sr@denx.de. Découvrez
>> pourquoi ceci est important à
>> https://aka.ms/LearnAboutSenderIdentification ]
>>
>> Hi Simon,
>> Hi Christophe,
>>
>> On 7/6/23 17:57, Simon Glass wrote:
>>> On Wed, 5 Jul 2023 at 09:54, Christophe Leroy
>>> <christophe.leroy@csgroup.eu> wrote:
>>>>
>>>> Uncompressing a 1.7Mbytes FIT image on U-boot 2023.04 takes
>>>> approx 7s on a powerpc 8xx.
>>>> The same on U-boot 2023.07-rc6 takes approx 28s unless watchdog
>>>> is disabled.
>>>>
>>>> During that decompression, LzmaDec_DecodeReal() calls schedule
>>>> 1.6 million times, that is every 4µs in average.
>>>>
>>>> In the past it used to be a call to WATCHDOG_RESET() which was
>>>> just calling hw_watchdog_reset().
>>>>
>>>> But the combination of commit 29caf9305b6 ("cyclic: Use schedule()
>>>> instead of WATCHDOG_RESET()") and commit 26e8ebcd7cb ("watchdog:
>>>> mpc8xxx: Make it generic") results in an heavier processing.
>>>>
>>>> However, there is absolutely no point in calling schedule() that
>>>> often.
>>>>
>>>> By moving and keeping only one call to schedule() in the main
>>>> loop the number of calls is reduced to 1.2 million which is still
>>>> too much. So add logic to only call schedule every 1024 times.
>>>> That leads to a call to schedule approx every 6ms which is still
>>>> far enough to entertain the watchdog which has a 1s timeout on
>>>> powerpc 8xx.
>>>>
>>>> powerpc 8xx being one of the slowest targets we have today in
>>>> U-boot, and most other watchdogs having a timeout of one minutes
>>>> instead of one second like the 8xx, this fix should not have
>>>> negative impact on other targets.
>>>>
>>>> Fixes: 29caf9305b6 ("cyclic: Use schedule() instead of
>>>> WATCHDOG_RESET()")
>>>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>>>> ---
>>>>    lib/lzma/LzmaDec.c | 18 ++++--------------
>>>>    1 file changed, 4 insertions(+), 14 deletions(-)
>>>
>>> Reviewed-by: Simon Glass <sjg@chromium.org>
>>>
>>> +Stefan Roese I wonder if we need a more general fix?
>>
>> IIRC, we already have some code in the watchdog IF making sure, that the
>> WDT reset is not called too often. I need to re-check here. And yes, we
>> should probably either move this code to the common schedule() function
>> or add it here. Best by adding a new Kconfig option, configuring the max
>> schedule frequency, e.g. 1000 Hz.
> 
> Yes we may add a ratelimitation inside schedule() but if we only do that
> it will only be a plaster. It will still use a huge amount of CPU just
> for reading the time and checking the limit.
> 
> A caller shouldn't call schedule() million of times per second, I think
> the root cause need to be fixed anyway. LZMA has this problem, ZLIB has
> not. I didn't check other decompressors.

Agreed. We should do both, change schedule() and also fix the high
frequent callers.

> The best would be that schedule() throw a WARN_ONCE() when that happens
> so that we can identify frantic callers.

Good idea. Let me see if I can come up with such a patch. Will take a
few days though as I'm traveling the next few days.

Thanks,
Stefan
Tom Rini July 15, 2023, 3:03 p.m. UTC | #5
On Wed, Jul 05, 2023 at 10:34:26AM +0200, Christophe Leroy wrote:

> Uncompressing a 1.7Mbytes FIT image on U-boot 2023.04 takes
> approx 7s on a powerpc 8xx.
> The same on U-boot 2023.07-rc6 takes approx 28s unless watchdog
> is disabled.
> 
> During that decompression, LzmaDec_DecodeReal() calls schedule
> 1.6 million times, that is every 4µs in average.
> 
> In the past it used to be a call to WATCHDOG_RESET() which was
> just calling hw_watchdog_reset().
> 
> But the combination of commit 29caf9305b6 ("cyclic: Use schedule()
> instead of WATCHDOG_RESET()") and commit 26e8ebcd7cb ("watchdog:
> mpc8xxx: Make it generic") results in an heavier processing.
> 
> However, there is absolutely no point in calling schedule() that
> often.
> 
> By moving and keeping only one call to schedule() in the main
> loop the number of calls is reduced to 1.2 million which is still
> too much. So add logic to only call schedule every 1024 times.
> That leads to a call to schedule approx every 6ms which is still
> far enough to entertain the watchdog which has a 1s timeout on
> powerpc 8xx.
> 
> powerpc 8xx being one of the slowest targets we have today in
> U-boot, and most other watchdogs having a timeout of one minutes
> instead of one second like the 8xx, this fix should not have
> negative impact on other targets.
> 
> Fixes: 29caf9305b6 ("cyclic: Use schedule() instead of WATCHDOG_RESET()")
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> Reviewed-by: Simon Glass <sjg@chromium.org>

Applied to u-boot/master, thanks!
diff mbox series

Patch

diff --git a/lib/lzma/LzmaDec.c b/lib/lzma/LzmaDec.c
index 341149f766..a90b35c6a9 100644
--- a/lib/lzma/LzmaDec.c
+++ b/lib/lzma/LzmaDec.c
@@ -152,8 +152,7 @@  static int MY_FAST_CALL LzmaDec_DecodeReal(CLzmaDec *p, SizeT limit, const Byte
   const Byte *buf = p->buf;
   UInt32 range = p->range;
   UInt32 code = p->code;
-
-  schedule();
+  unsigned int loop = 0;
 
   do
   {
@@ -162,6 +161,9 @@  static int MY_FAST_CALL LzmaDec_DecodeReal(CLzmaDec *p, SizeT limit, const Byte
     unsigned ttt;
     unsigned posState = processedPos & pbMask;
 
+    if (!(loop++ & 1023))
+	    schedule();
+
     prob = probs + IsMatch + (state << kNumPosBitsMax) + posState;
     IF_BIT_0(prob)
     {
@@ -177,8 +179,6 @@  static int MY_FAST_CALL LzmaDec_DecodeReal(CLzmaDec *p, SizeT limit, const Byte
         state -= (state < 4) ? state : 3;
         symbol = 1;
 
-        schedule();
-
         do { GET_BIT(prob + symbol, symbol) } while (symbol < 0x100);
       }
       else
@@ -188,8 +188,6 @@  static int MY_FAST_CALL LzmaDec_DecodeReal(CLzmaDec *p, SizeT limit, const Byte
         state -= (state < 10) ? 3 : 6;
         symbol = 1;
 
-        schedule();
-
         do
         {
           unsigned bit;
@@ -321,8 +319,6 @@  static int MY_FAST_CALL LzmaDec_DecodeReal(CLzmaDec *p, SizeT limit, const Byte
               UInt32 mask = 1;
               unsigned i = 1;
 
-              schedule();
-
               do
               {
                 GET_BIT2(prob + i, i, ; , distance |= mask);
@@ -335,8 +331,6 @@  static int MY_FAST_CALL LzmaDec_DecodeReal(CLzmaDec *p, SizeT limit, const Byte
           {
             numDirectBits -= kNumAlignBits;
 
-            schedule();
-
             do
             {
               NORMALIZE
@@ -409,8 +403,6 @@  static int MY_FAST_CALL LzmaDec_DecodeReal(CLzmaDec *p, SizeT limit, const Byte
           const Byte *lim = dest + curLen;
           dicPos += curLen;
 
-          schedule();
-
           do
             *(dest) = (Byte)*(dest + src);
           while (++dest != lim);
@@ -418,8 +410,6 @@  static int MY_FAST_CALL LzmaDec_DecodeReal(CLzmaDec *p, SizeT limit, const Byte
         else
         {
 
-          schedule();
-
           do
           {
             dic[dicPos++] = dic[pos];