diff mbox series

[3/3] ipc_wait_for_complete: wait a bit if the first status we get was IDLE

Message ID 20211119090401.664993-3-dominique.martinet@atmark-techno.com
State Changes Requested
Headers show
Series [1/3] fix missing field initializer warnings | expand

Commit Message

Dominique MARTINET Nov. 19, 2021, 9:04 a.m. UTC
It's possible swupdate_async_thread finishes writing the whole image and
starts waiting for install to finish before the install actually started
(especially if image is small and with -i)

In this case, the initial get status call will return IDLE and be handled
as if the update was already finished, and exit after 1s wait (because
the initial status = IDLE == returned status, the loop also waits and
swupdate exits for seemingly no reason 1s after the update started
seemingly normally)

Normally the notify() calls will build a backlog and we could assume that
we will eventually get a non-IDLE state here, but if another process
checks the ipc first it's possible all the notifications have been
consumed and we cannot rely on that: only wait once, and unless
something really bad happened hopefully after 1s the update will
really have started...

Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
---

I'm happy to do something else here and only added this delay as a
stopgap: if you have a better idea I'm all ears.

In particular I've thought of extending the status with a loop iteration
or something, and check before install and in wait that the loop
iteration changed, but even that might not be robust if another process
raced us (although that's quite unlikely in the swupdate -i case...),
but changing the protocol is probably overkill.


Slightly unrelated (it wouldn't help here), I've been annoyed by this
loop polling every second, so having something like callbacks to reply
to get status only when status changes would be something I'd like
doing eventually

Thanks!

 ipc/network_ipc.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

Comments

Stefano Babic Nov. 21, 2021, 10:51 a.m. UTC | #1
Hi Dominique,

On 19.11.21 10:04, Dominique Martinet wrote:
> It's possible swupdate_async_thread finishes writing the whole image and
> starts waiting for install to finish before the install actually started
> (especially if image is small and with -i)
> 
> In this case, the initial get status call will return IDLE and be handled
> as if the update was already finished, and exit after 1s wait (because
> the initial status = IDLE == returned status, the loop also waits and
> swupdate exits for seemingly no reason 1s after the update started
> seemingly normally)

True, but this led to create a different interface (the progress 
interface) that is thought to overcome these limitation. The progress 
copies and sends notifications to all registered listeners. Via progress 
interface, a separate process do not miss a state transition.

> 
> Normally the notify() calls will build a backlog and we could assume that
> we will eventually get a non-IDLE state here, but if another process
> checks the ipc first it's possible all the notifications have been
> consumed and we cannot rely on that:

True, but checks above.

> only wait once, and unless
> something really bad happened hopefully after 1s the update will
> really have started...
> 
> Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
> ---
> 
> I'm happy to do something else here and only added this delay as a
> stopgap: if you have a better idea I'm all ears.

Adding delays is just last chance, better to synchronize.

> 
> In particular I've thought of extending the status with a loop iteration
> or something,

Thing is that to drop the initial limitations (status retrieved by the 
caller process), I introduced the progress interface and this should be 
used with multiple processes. There should not be any race.

For swupdate_async_thread () the end() calback is used. This shuld also 
guarantee that the caller does not miss update's result.

> and check before install and in wait that the loop
> iteration changed, but even that might not be robust if another process
> raced us (although that's quite unlikely in the swupdate -i case...),
> but changing the protocol is probably overkill.
> 
> 
> Slightly unrelated (it wouldn't help here), I've been annoyed by this
> loop polling every second, so having something like callbacks to reply
> to get status

....this is the progress interface, it is event trigger.

> only when status changes would be something I'd like
> doing eventually
> 
> Thanks!
>

Best regards,
Stefano Babic


>   ipc/network_ipc.c | 27 ++++++++++++++++++++-------
>   1 file changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/ipc/network_ipc.c b/ipc/network_ipc.c
> index 5ccc18b6c6d4..0fa585e69323 100644
> --- a/ipc/network_ipc.c
> +++ b/ipc/network_ipc.c
> @@ -338,15 +338,28 @@ int ipc_wait_for_complete(getstatus callback)
>   			break;
>   		}
>   
> -		if (( (status != (RECOVERY_STATUS)message.data.status.current) ||
> -			strlen(message.data.status.desc))) {
> -				if (callback)
> -					callback(&message);
> -			} else
> -				sleep(1);
> +		if (status == IDLE && message.data.status.current == IDLE) {
> +			/* if we get idle state on first iteration (status == IDLE),
> +			 * then either the update hasn't started yet or is already
> +			 * finished and someone else already consumed all the
> +			 * notifications: wait 1s and retry just once in case
> +			 * it's the former. */
> +			sleep(1);
> +			/* use some invalid state to not skip next callback... */
> +			status = -1;
> +			continue;
> +		}
> +
> +		if (status != (RECOVERY_STATUS)message.data.status.current ||
> +		    strlen(message.data.status.desc)) {
> +			if (callback)
> +				callback(&message);
> +		} else {
> +			sleep(1);
> +		}
>   
>   		status = (RECOVERY_STATUS)message.data.status.current;
> -	} while(message.data.status.current != IDLE);
> +	} while (message.data.status.current != IDLE);
>   
>   	return message.data.status.last_result;
>   }
>
Dominique MARTINET Nov. 21, 2021, 11:27 p.m. UTC | #2
Stefano Babic wrote on Sun, Nov 21, 2021 at 11:51:38AM +0100:
> > In this case, the initial get status call will return IDLE and be handled
> > as if the update was already finished, and exit after 1s wait (because
> > the initial status = IDLE == returned status, the loop also waits and
> > swupdate exits for seemingly no reason 1s after the update started
> > seemingly normally)
> 
> True, but this led to create a different interface (the progress interface)
> that is thought to overcome these limitation. The progress copies and sends
> notifications to all registered listeners. Via progress interface, a
> separate process do not miss a state transition.

> > In particular I've thought of extending the status with a loop iteration
> > or something,
> 
> Thing is that to drop the initial limitations (status retrieved by the
> caller process), I introduced the progress interface and this should be used
> with multiple processes. There should not be any race.
> 
> For swupdate_async_thread () the end() calback is used. This shuld also
> guarantee that the caller does not miss update's result.

Hm, that end() callback is unrelated: the problem is that the end()
callback is called too early, because ipc_wait_for_complete() returned
as the state was IDLE before the update actually started.

> > Slightly unrelated (it wouldn't help here), I've been annoyed by this
> > loop polling every second, so having something like callbacks to reply
> > to get status
> 
> ....this is the progress interface, it is event trigger.

Ah, I was focused on the state itself but this was available for
notify() calls themselves, I see it now.

In this case, the best way I can see would be to register a progress
socket at the start of swupdate_async_thread, consume it in each loop
iteration to avoid blocking if the pipe is full, and wait for it to
finish after the final write.

I'll send a v2 today that does this.


I also see three code paths that do channel->get_file() followed by
ipc_wait_for_complete() which probably suffer from the same race
(in corelib/downloader.c, suricatta/server_general.c, and
suricatta/server_hawkbit.c) -- it might actually make sense to update
these as well... I'll see if I can reproduce with -d and a very small
file as well.
Stefano Babic Nov. 22, 2021, 9:58 a.m. UTC | #3
Hi Dominique,

On 22.11.21 00:27, Dominique Martinet wrote:
> Stefano Babic wrote on Sun, Nov 21, 2021 at 11:51:38AM +0100:
>>> In this case, the initial get status call will return IDLE and be handled
>>> as if the update was already finished, and exit after 1s wait (because
>>> the initial status = IDLE == returned status, the loop also waits and
>>> swupdate exits for seemingly no reason 1s after the update started
>>> seemingly normally)
>>
>> True, but this led to create a different interface (the progress interface)
>> that is thought to overcome these limitation. The progress copies and sends
>> notifications to all registered listeners. Via progress interface, a
>> separate process do not miss a state transition.
> 
>>> In particular I've thought of extending the status with a loop iteration
>>> or something,
>>
>> Thing is that to drop the initial limitations (status retrieved by the
>> caller process), I introduced the progress interface and this should be used
>> with multiple processes. There should not be any race.
>>
>> For swupdate_async_thread () the end() calback is used. This shuld also
>> guarantee that the caller does not miss update's result.
> 
> Hm, that end() callback is unrelated: the problem is that the end()
> callback is called too early, because ipc_wait_for_complete() returned
> as the state was IDLE before the update actually started.
> 

I got the point and I see the race condition, true.

>>> Slightly unrelated (it wouldn't help here), I've been annoyed by this
>>> loop polling every second, so having something like callbacks to reply
>>> to get status
>>
>> ....this is the progress interface, it is event trigger.
> 
> Ah, I was focused on the state itself but this was available for
> notify() calls themselves, I see it now.
> 
> In this case, the best way I can see would be to register a progress
> socket at the start of swupdate_async_thread, consume it in each loop
> iteration to avoid blocking if the pipe is full, and wait for it to
> finish after the final write.
> 
> I'll send a v2 today that does this.
> 
> 
> I also see three code paths that do channel->get_file() followed by
> ipc_wait_for_complete() which probably suffer from the same race
> (in corelib/downloader.c, suricatta/server_general.c, and
> suricatta/server_hawkbit.c) -- it might actually make sense to update
> these as well... I'll see if I can reproduce with -d and a very small
> file as well.
> 

Regards,
Stefano
diff mbox series

Patch

diff --git a/ipc/network_ipc.c b/ipc/network_ipc.c
index 5ccc18b6c6d4..0fa585e69323 100644
--- a/ipc/network_ipc.c
+++ b/ipc/network_ipc.c
@@ -338,15 +338,28 @@  int ipc_wait_for_complete(getstatus callback)
 			break;
 		}
 
-		if (( (status != (RECOVERY_STATUS)message.data.status.current) ||
-			strlen(message.data.status.desc))) {
-				if (callback)
-					callback(&message);
-			} else
-				sleep(1);
+		if (status == IDLE && message.data.status.current == IDLE) {
+			/* if we get idle state on first iteration (status == IDLE),
+			 * then either the update hasn't started yet or is already
+			 * finished and someone else already consumed all the
+			 * notifications: wait 1s and retry just once in case
+			 * it's the former. */
+			sleep(1);
+			/* use some invalid state to not skip next callback... */
+			status = -1;
+			continue;
+		}
+
+		if (status != (RECOVERY_STATUS)message.data.status.current ||
+		    strlen(message.data.status.desc)) {
+			if (callback)
+				callback(&message);
+		} else {
+			sleep(1);
+		}
 
 		status = (RECOVERY_STATUS)message.data.status.current;
-	} while(message.data.status.current != IDLE);
+	} while (message.data.status.current != IDLE);
 
 	return message.data.status.last_result;
 }