[2/2] Check for dead connections on new progress IPC client

Message ID	20210218195157.3175119-1-sava.jakovljev@teufel.de
State	Changes Requested
Headers	show Return-Path: <swupdate+bncBAABB4EKXOAQMGQETL3W65I@googlegroups.com> Sender: swupdate@googlegroups.com Received-SPF: pass (google.com: domain of sava.jakovljev@teufel.de designates 94.100.136.171 as permitted sender) client-ip=94.100.136.171; From: Sava Jakovljev <sava.jakovljev@teufel.de> To: <swupdate@googlegroups.com> CC: Sava Jakovljev <sava.jakovljev@teufel.de> Subject: [swupdate] [PATCH 2/2] Check for dead connections on new progress IPC client Date: Thu, 18 Feb 2021 20:51:57 +0100 Message-ID: <20210218195157.3175119-1-sava.jakovljev@teufel.de> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Precedence: list Mailing-list: list swupdate@googlegroups.com; contact swupdate+owners@googlegroups.com
Series	[1/2] Core: Properly close socket after processing in networking thread \| expand [1/2] Core: Properly close socket after processing in networking thread [2/2] Check for dead connections on new progress IPC client

Sava Jakovljev Feb. 18, 2021, 7:51 p.m. UTC

---
 core/progress_thread.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

Stefano Babic Feb. 19, 2021, 7:56 a.m. UTC | #1

Hi Sava,

On 18.02.21 20:51, Sava Jakovljev wrote:
> ---
>   core/progress_thread.c | 38 ++++++++++++++++++++++++++++++++++++++
>   1 file changed, 38 insertions(+)
> 
> diff --git a/core/progress_thread.c b/core/progress_thread.c
> index 1d4facd..8705bfd 100644
> --- a/core/progress_thread.c
> +++ b/core/progress_thread.c
> @@ -56,6 +56,43 @@ struct swupdate_progress {
>   };
>   static struct swupdate_progress progress;
>   
> +static bool is_socket_closed(int sock)
> +{
> +	fd_set fdset;
> +
> +	FD_ZERO(&fdset);
> +	FD_SET(sock, &fdset);
> +
> +	struct timeval tv = {0};
> +	select(sock + 1, &fdset, 0, 0, &tv);
> +
> +	if (!FD_ISSET(sock, &fdset))
> +		return false;
> +
> +	int n = 0;
> +	ioctl(sock, FIONREAD, &n);
> +

Why the socket is closed if there are no data to be read ? It looks to 
me unrelated.

> +	return n == 0;
> +}
> +
> +static void remove_dead_sockets(struct swupdate_progress* pprog)
> +{
> +	struct progress_conn *conn, *tmp;
> +
> +	SIMPLEQ_FOREACH_SAFE(conn, &pprog->conns, next, tmp)
> +	{
> +		if (is_socket_closed(conn->sockfd))
> +		{
> +			TRACE("A progress client disappeared, removing it.");
> +			close(conn->sockfd);
> +
> +			SIMPLEQ_REMOVE(&pprog->conns, conn,
> +							progress_conn, next);
> +			free(conn);
> +		}
> +	}
> +}
> +
>   /*
>    * This must be called after acquiring the mutex
>    * for the progress structure
> @@ -282,6 +319,7 @@ void *progress_bar_thread (void __attribute__ ((__unused__)) *data)
>   		conn->sockfd = connfd;
>   		pthread_mutex_lock(&pprog->lock);
>   		SIMPLEQ_INSERT_TAIL(&pprog->conns, conn, next);
> +		remove_dead_sockets(pprog);

Why do we need this housekeeping here ?

The connection are verified any time a message is sent in 
send_progress_msg(). If socket is closed, write() returns with error and 
the connection is removed from the list. This seems to me arbitrary.

Best regards,
Stefano Babic




>   		pthread_mutex_unlock(&pprog->lock);
>   	} while(1);
>   }
>

Sava Jakovljev Feb. 19, 2021, 1:04 p.m. UTC | #2

Hi Stefano,

It's not really necessary, but it deals with the situation when clients 
connect and disconnect, and there are no progress messages in between.
We encountered that situation in a test-scenario, not a real-world one, but 
in my opinion, having this kind of resource leak should be handled as soon 
as possible, and not just when we want to send a message.
Also, it is not impossible for problem to occur in real-world - one just 
has to try hard enough. And yes, it will be solved on first progress 
message sent, but then it might just be a bit too late. For example, if one 
wants to use control IPC, having the response of "you have to wait until we 
close our connections" is a no-go - and that will happen if this problem 
occurs. I agree, it is unlikely - but it is possible.
I don't see much need to periodically check for dead connections, so this 
is a middle-ground - if my reasoning is right, we are actually preventing 
any resource leakage this way without too much overhead.

Thank you.
Best regards,
Sava Jakovljev
Stefano Babic schrieb am Freitag, 19. Februar 2021 um 08:56:34 UTC+1:

> Hi Sava,
>
> On 18.02.21 20:51, Sava Jakovljev wrote:
> > ---
> > core/progress_thread.c | 38 ++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 38 insertions(+)
> > 
> > diff --git a/core/progress_thread.c b/core/progress_thread.c
> > index 1d4facd..8705bfd 100644
> > --- a/core/progress_thread.c
> > +++ b/core/progress_thread.c
> > @@ -56,6 +56,43 @@ struct swupdate_progress {
> > };
> > static struct swupdate_progress progress;
> > 
> > +static bool is_socket_closed(int sock)
> > +{
> > + fd_set fdset;
> > +
> > + FD_ZERO(&fdset);
> > + FD_SET(sock, &fdset);
> > +
> > + struct timeval tv = {0};
> > + select(sock + 1, &fdset, 0, 0, &tv);
> > +
> > + if (!FD_ISSET(sock, &fdset))
> > + return false;
> > +
> > + int n = 0;
> > + ioctl(sock, FIONREAD, &n);
> > +
>
> Why the socket is closed if there are no data to be read ? It looks to 
> me unrelated.
>
> > + return n == 0;
> > +}
> > +
> > +static void remove_dead_sockets(struct swupdate_progress* pprog)
> > +{
> > + struct progress_conn *conn, *tmp;
> > +
> > + SIMPLEQ_FOREACH_SAFE(conn, &pprog->conns, next, tmp)
> > + {
> > + if (is_socket_closed(conn->sockfd))
> > + {
> > + TRACE("A progress client disappeared, removing it.");
> > + close(conn->sockfd);
> > +
> > + SIMPLEQ_REMOVE(&pprog->conns, conn,
> > + progress_conn, next);
> > + free(conn);
> > + }
> > + }
> > +}
> > +
> > /*
> > * This must be called after acquiring the mutex
> > * for the progress structure
> > @@ -282,6 +319,7 @@ void *progress_bar_thread (void __attribute__ 
> ((__unused__)) *data)
> > conn->sockfd = connfd;
> > pthread_mutex_lock(&pprog->lock);
> > SIMPLEQ_INSERT_TAIL(&pprog->conns, conn, next);
> > + remove_dead_sockets(pprog);
>
> Why do we need this housekeeping here ?
>
> The connection are verified any time a message is sent in 
> send_progress_msg(). If socket is closed, write() returns with error and 
> the connection is removed from the list. This seems to me arbitrary.
>
> Best regards,
> Stefano Babic
>
>
>
>
> > pthread_mutex_unlock(&pprog->lock);
> > } while(1);
> > }
> > 
>
> -- 
> =====================================================================
> DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> Phone: +49-8142-66989-53 <+49%208142%206698953> Fax: +49-8142-66989-80 
> <+49%208142%206698980> Email: sba...@denx.de
> =====================================================================
>

Stefano Babic Feb. 19, 2021, 1:25 p.m. UTC | #3

Hi Sava,

On 19.02.21 14:04, Sava Jakovljev wrote:
> Hi Stefano,
> 
> It's not really necessary, but it deals with the situation when clients
> connect and disconnect, and there are no progress messages in between.

The progress IPC is local, ther eis now way to connect it outside the
device. If a client connects and disconnects, it is simply a buggy
application that must be fixed.

> We encountered that situation in a test-scenario, not a real-world one,
> but in my opinion, having this kind of resource leak should be handled
> as soon as possible, and not just when we want to send a message.

This is not a resouce leak because SWUpdate does not really know if the
connection is still on and using FIONREAD to detect this is not correct
IMHO.

> Also, it is not impossible for problem to occur in real-world - one just
> has to try hard enough.

It remains a buggy application - if we want to protect this, we should
use what Linux already offers, for example setting accesses to the socket.

> And yes, it will be solved on first progress
> message sent, but then it might just be a bit too late.

I do not think so.

> For example, if
> one wants to use control IPC,

control IPC does not flow into the progress, they have an own connection
where you found resource leak that you fix with the first patch.

> having the response of "you have to wait
> until we close our connections" is a no-go

But again, this is caused by the resource leak in [1].

> - and that will happen if
> this problem occurs. I agree, it is unlikely - but it is possible.

This is out of context - you are mixing the progress interface with the
control interface. The progress interface is unidirectional, and data
flow from SWUpdate to the registered client.

> I don't see much need to periodically check for dead connections,

I do not see it at all.

> so
> this is a middle-ground - if my reasoning is right, we are actually
> preventing any resource leakage this way without too much overhead.

I do not think there is a resource leak in progress interface, while
there is clearly in the control interface.

Best regards,
Stefano Babic

> 
> Thank you.
> Best regards,
> Sava Jakovljev
> Stefano Babic schrieb am Freitag, 19. Februar 2021 um 08:56:34 UTC+1:
> 
>     Hi Sava,
> 
>     On 18.02.21 20:51, Sava Jakovljev wrote:
>     > ---
>     > core/progress_thread.c | 38 ++++++++++++++++++++++++++++++++++++++
>     > 1 file changed, 38 insertions(+)
>     >
>     > diff --git a/core/progress_thread.c b/core/progress_thread.c
>     > index 1d4facd..8705bfd 100644
>     > --- a/core/progress_thread.c
>     > +++ b/core/progress_thread.c
>     > @@ -56,6 +56,43 @@ struct swupdate_progress {
>     > };
>     > static struct swupdate_progress progress;
>     >
>     > +static bool is_socket_closed(int sock)
>     > +{
>     > + fd_set fdset;
>     > +
>     > + FD_ZERO(&fdset);
>     > + FD_SET(sock, &fdset);
>     > +
>     > + struct timeval tv = {0};
>     > + select(sock + 1, &fdset, 0, 0, &tv);
>     > +
>     > + if (!FD_ISSET(sock, &fdset))
>     > + return false;
>     > +
>     > + int n = 0;
>     > + ioctl(sock, FIONREAD, &n);
>     > +
> 
>     Why the socket is closed if there are no data to be read ? It looks to
>     me unrelated.
> 
>     > + return n == 0;
>     > +}
>     > +
>     > +static void remove_dead_sockets(struct swupdate_progress* pprog)
>     > +{
>     > + struct progress_conn *conn, *tmp;
>     > +
>     > + SIMPLEQ_FOREACH_SAFE(conn, &pprog->conns, next, tmp)
>     > + {
>     > + if (is_socket_closed(conn->sockfd))
>     > + {
>     > + TRACE("A progress client disappeared, removing it.");
>     > + close(conn->sockfd);
>     > +
>     > + SIMPLEQ_REMOVE(&pprog->conns, conn,
>     > + progress_conn, next);
>     > + free(conn);
>     > + }
>     > + }
>     > +}
>     > +
>     > /*
>     > * This must be called after acquiring the mutex
>     > * for the progress structure
>     > @@ -282,6 +319,7 @@ void *progress_bar_thread (void __attribute__
>     ((__unused__)) *data)
>     > conn->sockfd = connfd;
>     > pthread_mutex_lock(&pprog->lock);
>     > SIMPLEQ_INSERT_TAIL(&pprog->conns, conn, next);
>     > + remove_dead_sockets(pprog);
> 
>     Why do we need this housekeeping here ?
> 
>     The connection are verified any time a message is sent in
>     send_progress_msg(). If socket is closed, write() returns with error
>     and
>     the connection is removed from the list. This seems to me arbitrary.
> 
>     Best regards,
>     Stefano Babic
> 
> 
> 
> 
>     > pthread_mutex_unlock(&pprog->lock);
>     > } while(1);
>     > }
>     >
> 
>     -- 
>     =====================================================================
>     DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
>     HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
>     Phone: +49-8142-66989-53 <tel:+49%208142%206698953> Fax:
>     +49-8142-66989-80 <tel:+49%208142%206698980> Email: sba...@denx.de
>     =====================================================================
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "swupdate" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to swupdate+unsubscribe@googlegroups.com
> <mailto:swupdate+unsubscribe@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com
> <https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sava Jakovljev Feb. 19, 2021, 1:34 p.m. UTC | #4

Hi Stefano,

Sure, if you want to keep it this way, no problem for me - as stated, it's 
mostly to have a bit better clean-up.
I would argue the clean-up could be done better (to be precise, you are 
right and there is no leak, but one could get into a bad situation anyway, 
(not easy, granted) so technicality doesn't really matter).
For our needs we will keep this kind of clean-up, taking your comments into 
account, but yeah, feel free to ignore it if you think that's the best.

Thank you.
Best regards,
Sava Jakovljev

Stefano Babic schrieb am Freitag, 19. Februar 2021 um 14:25:11 UTC+1:

> Hi Sava,
>
> On 19.02.21 14:04, Sava Jakovljev wrote:
> > Hi Stefano,
> > 
> > It's not really necessary, but it deals with the situation when clients
> > connect and disconnect, and there are no progress messages in between.
>
> The progress IPC is local, ther eis now way to connect it outside the
> device. If a client connects and disconnects, it is simply a buggy
> application that must be fixed.
>
> > We encountered that situation in a test-scenario, not a real-world one,
> > but in my opinion, having this kind of resource leak should be handled
> > as soon as possible, and not just when we want to send a message.
>
> This is not a resouce leak because SWUpdate does not really know if the
> connection is still on and using FIONREAD to detect this is not correct
> IMHO.
>
> > Also, it is not impossible for problem to occur in real-world - one just
> > has to try hard enough.
>
> It remains a buggy application - if we want to protect this, we should
> use what Linux already offers, for example setting accesses to the socket.
>
> > And yes, it will be solved on first progress
> > message sent, but then it might just be a bit too late.
>
> I do not think so.
>
> > For example, if
> > one wants to use control IPC,
>
> control IPC does not flow into the progress, they have an own connection
> where you found resource leak that you fix with the first patch.
>
> > having the response of "you have to wait
> > until we close our connections" is a no-go
>
> But again, this is caused by the resource leak in [1].
>
> > - and that will happen if
> > this problem occurs. I agree, it is unlikely - but it is possible.
>
> This is out of context - you are mixing the progress interface with the
> control interface. The progress interface is unidirectional, and data
> flow from SWUpdate to the registered client.
>
> > I don't see much need to periodically check for dead connections,
>
> I do not see it at all.
>
> > so
> > this is a middle-ground - if my reasoning is right, we are actually
> > preventing any resource leakage this way without too much overhead.
>
> I do not think there is a resource leak in progress interface, while
> there is clearly in the control interface.
>
> Best regards,
> Stefano Babic
>
> > 
> > Thank you.
> > Best regards,
> > Sava Jakovljev
> > Stefano Babic schrieb am Freitag, 19. Februar 2021 um 08:56:34 UTC+1:
> > 
> > Hi Sava,
> > 
> > On 18.02.21 20:51, Sava Jakovljev wrote:
> > > ---
> > > core/progress_thread.c | 38 ++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 38 insertions(+)
> > >
> > > diff --git a/core/progress_thread.c b/core/progress_thread.c
> > > index 1d4facd..8705bfd 100644
> > > --- a/core/progress_thread.c
> > > +++ b/core/progress_thread.c
> > > @@ -56,6 +56,43 @@ struct swupdate_progress {
> > > };
> > > static struct swupdate_progress progress;
> > >
> > > +static bool is_socket_closed(int sock)
> > > +{
> > > + fd_set fdset;
> > > +
> > > + FD_ZERO(&fdset);
> > > + FD_SET(sock, &fdset);
> > > +
> > > + struct timeval tv = {0};
> > > + select(sock + 1, &fdset, 0, 0, &tv);
> > > +
> > > + if (!FD_ISSET(sock, &fdset))
> > > + return false;
> > > +
> > > + int n = 0;
> > > + ioctl(sock, FIONREAD, &n);
> > > +
> > 
> > Why the socket is closed if there are no data to be read ? It looks to
> > me unrelated.
> > 
> > > + return n == 0;
> > > +}
> > > +
> > > +static void remove_dead_sockets(struct swupdate_progress* pprog)
> > > +{
> > > + struct progress_conn *conn, *tmp;
> > > +
> > > + SIMPLEQ_FOREACH_SAFE(conn, &pprog->conns, next, tmp)
> > > + {
> > > + if (is_socket_closed(conn->sockfd))
> > > + {
> > > + TRACE("A progress client disappeared, removing it.");
> > > + close(conn->sockfd);
> > > +
> > > + SIMPLEQ_REMOVE(&pprog->conns, conn,
> > > + progress_conn, next);
> > > + free(conn);
> > > + }
> > > + }
> > > +}
> > > +
> > > /*
> > > * This must be called after acquiring the mutex
> > > * for the progress structure
> > > @@ -282,6 +319,7 @@ void *progress_bar_thread (void __attribute__
> > ((__unused__)) *data)
> > > conn->sockfd = connfd;
> > > pthread_mutex_lock(&pprog->lock);
> > > SIMPLEQ_INSERT_TAIL(&pprog->conns, conn, next);
> > > + remove_dead_sockets(pprog);
> > 
> > Why do we need this housekeeping here ?
> > 
> > The connection are verified any time a message is sent in
> > send_progress_msg(). If socket is closed, write() returns with error
> > and
> > the connection is removed from the list. This seems to me arbitrary.
> > 
> > Best regards,
> > Stefano Babic
> > 
> > 
> > 
> > 
> > > pthread_mutex_unlock(&pprog->lock);
> > > } while(1);
> > > }
> > >
> > 
> > -- 
> > =====================================================================
> > DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
> > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> > Phone: +49-8142-66989-53 <+49%208142%206698953> 
> <tel:+49%208142%206698953> Fax:
> > +49-8142-66989-80 <+49%208142%206698980> <tel:+49%208142%206698980> 
> Email: sba...@denx.de
> > =====================================================================
> > 
> > -- 
> > You received this message because you are subscribed to the Google
> > Groups "swupdate" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> > an email to swupdate+u...@googlegroups.com
> > <mailto:swupdate+u...@googlegroups.com>.
> > To view this discussion on the web visit
> > 
> https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com
> > <
> https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com?utm_medium=email&utm_source=footer
> >.
>
>
> -- 
> =====================================================================
> DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> Phone: +49-8142-66989-53 <+49%208142%206698953> Fax: +49-8142-66989-80 
> <+49%208142%206698980> Email: sba...@denx.de
> =====================================================================
>

Sava Jakovljev Feb. 19, 2021, 4:59 p.m. UTC | #5

Hello,

Just to provide a way that represents better what is the concern and why 
this is a leak in my opinion, even though sockets are saved in swupdate 
code:
I ran a script which calls swupdate-progress in an infinite loop, waits for 
short period of time, then exits swupdate-progress and continue the loop. 
After few minutes swupdate has more than 1024 open fd's. This is clearly a 
bug - why shouldn't we check for dead connections when accepting a new one?
As a consequence, command IPC is not working anymore - Linux forbids 
swupdate to open any new file - it doesn't matter at all that progress IPC 
and command IPC are separated, they are part of the same process - so I 
don't think this is out of context or that I am mixing any terms. From 
Linux point of view, a process is simply exceeding maximum number of open 
files so it won't allow it to take any more.

Waiting for swupdate to send a message to progress IPC to clear them up is 
not a good solution, in my opinion.
Giving that swupdate is a background process which sits idle most of the 
time and that progress IPC is only sent when there is an upgrade, clearing 
dangling fd's  point when message is actually sent, which is not often, and 
worse,  when upgrade status is changed, is unacceptable, IMHO. I'm just 
trying to be clear here, that swupdate should be robust and stable 
regardless of buggy clients which could spawn many re-connections. Having 
"Too many open files" error is not what I would expect from an upgrade 
service in any scenario.

Regarding my implementation of the check, we can use getsockopt with 
SOL_SOCKET if you prefer that better, and if you agree with my reasoning - 
if I am making an error here, please let me know.

Best regards,
Sava Jakovljev
Sava Jakovljev schrieb am Freitag, 19. Februar 2021 um 14:34:31 UTC+1:

> Hi Stefano,
>
> Sure, if you want to keep it this way, no problem for me - as stated, it's 
> mostly to have a bit better clean-up.
> I would argue the clean-up could be done better (to be precise, you are 
> right and there is no leak, but one could get into a bad situation anyway, 
> (not easy, granted) so technicality doesn't really matter).
> For our needs we will keep this kind of clean-up, taking your comments 
> into account, but yeah, feel free to ignore it if you think that's the best.
>
> Thank you.
> Best regards,
> Sava Jakovljev
>
> Stefano Babic schrieb am Freitag, 19. Februar 2021 um 14:25:11 UTC+1:
>
>> Hi Sava,
>>
>> On 19.02.21 14:04, Sava Jakovljev wrote:
>> > Hi Stefano,
>> > 
>> > It's not really necessary, but it deals with the situation when clients
>> > connect and disconnect, and there are no progress messages in between.
>>
>> The progress IPC is local, ther eis now way to connect it outside the
>> device. If a client connects and disconnects, it is simply a buggy
>> application that must be fixed.
>>
>> > We encountered that situation in a test-scenario, not a real-world one,
>> > but in my opinion, having this kind of resource leak should be handled
>> > as soon as possible, and not just when we want to send a message.
>>
>> This is not a resouce leak because SWUpdate does not really know if the
>> connection is still on and using FIONREAD to detect this is not correct
>> IMHO.
>>
>> > Also, it is not impossible for problem to occur in real-world - one just
>> > has to try hard enough.
>>
>> It remains a buggy application - if we want to protect this, we should
>> use what Linux already offers, for example setting accesses to the socket.
>>
>> > And yes, it will be solved on first progress
>> > message sent, but then it might just be a bit too late.
>>
>> I do not think so.
>>
>> > For example, if
>> > one wants to use control IPC,
>>
>> control IPC does not flow into the progress, they have an own connection
>> where you found resource leak that you fix with the first patch.
>>
>> > having the response of "you have to wait
>> > until we close our connections" is a no-go
>>
>> But again, this is caused by the resource leak in [1].
>>
>> > - and that will happen if
>> > this problem occurs. I agree, it is unlikely - but it is possible.
>>
>> This is out of context - you are mixing the progress interface with the
>> control interface. The progress interface is unidirectional, and data
>> flow from SWUpdate to the registered client.
>>
>> > I don't see much need to periodically check for dead connections,
>>
>> I do not see it at all.
>>
>> > so
>> > this is a middle-ground - if my reasoning is right, we are actually
>> > preventing any resource leakage this way without too much overhead.
>>
>> I do not think there is a resource leak in progress interface, while
>> there is clearly in the control interface.
>>
>> Best regards,
>> Stefano Babic
>>
>> > 
>> > Thank you.
>> > Best regards,
>> > Sava Jakovljev
>> > Stefano Babic schrieb am Freitag, 19. Februar 2021 um 08:56:34 UTC+1:
>> > 
>> > Hi Sava,
>> > 
>> > On 18.02.21 20:51, Sava Jakovljev wrote:
>> > > ---
>> > > core/progress_thread.c | 38 ++++++++++++++++++++++++++++++++++++++
>> > > 1 file changed, 38 insertions(+)
>> > >
>> > > diff --git a/core/progress_thread.c b/core/progress_thread.c
>> > > index 1d4facd..8705bfd 100644
>> > > --- a/core/progress_thread.c
>> > > +++ b/core/progress_thread.c
>> > > @@ -56,6 +56,43 @@ struct swupdate_progress {
>> > > };
>> > > static struct swupdate_progress progress;
>> > >
>> > > +static bool is_socket_closed(int sock)
>> > > +{
>> > > + fd_set fdset;
>> > > +
>> > > + FD_ZERO(&fdset);
>> > > + FD_SET(sock, &fdset);
>> > > +
>> > > + struct timeval tv = {0};
>> > > + select(sock + 1, &fdset, 0, 0, &tv);
>> > > +
>> > > + if (!FD_ISSET(sock, &fdset))
>> > > + return false;
>> > > +
>> > > + int n = 0;
>> > > + ioctl(sock, FIONREAD, &n);
>> > > +
>> > 
>> > Why the socket is closed if there are no data to be read ? It looks to
>> > me unrelated.
>> > 
>> > > + return n == 0;
>> > > +}
>> > > +
>> > > +static void remove_dead_sockets(struct swupdate_progress* pprog)
>> > > +{
>> > > + struct progress_conn *conn, *tmp;
>> > > +
>> > > + SIMPLEQ_FOREACH_SAFE(conn, &pprog->conns, next, tmp)
>> > > + {
>> > > + if (is_socket_closed(conn->sockfd))
>> > > + {
>> > > + TRACE("A progress client disappeared, removing it.");
>> > > + close(conn->sockfd);
>> > > +
>> > > + SIMPLEQ_REMOVE(&pprog->conns, conn,
>> > > + progress_conn, next);
>> > > + free(conn);
>> > > + }
>> > > + }
>> > > +}
>> > > +
>> > > /*
>> > > * This must be called after acquiring the mutex
>> > > * for the progress structure
>> > > @@ -282,6 +319,7 @@ void *progress_bar_thread (void __attribute__
>> > ((__unused__)) *data)
>> > > conn->sockfd = connfd;
>> > > pthread_mutex_lock(&pprog->lock);
>> > > SIMPLEQ_INSERT_TAIL(&pprog->conns, conn, next);
>> > > + remove_dead_sockets(pprog);
>> > 
>> > Why do we need this housekeeping here ?
>> > 
>> > The connection are verified any time a message is sent in
>> > send_progress_msg(). If socket is closed, write() returns with error
>> > and
>> > the connection is removed from the list. This seems to me arbitrary.
>> > 
>> > Best regards,
>> > Stefano Babic
>> > 
>> > 
>> > 
>> > 
>> > > pthread_mutex_unlock(&pprog->lock);
>> > > } while(1);
>> > > }
>> > >
>> > 
>> > -- 
>> > =====================================================================
>> > DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
>> > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
>> > Phone: +49-8142-66989-53 <+49%208142%206698953> 
>> <tel:+49%208142%206698953> Fax:
>> > +49-8142-66989-80 <+49%208142%206698980> <tel:+49%208142%206698980> 
>> Email: sba...@denx.de
>> > =====================================================================
>> > 
>> > -- 
>> > You received this message because you are subscribed to the Google
>> > Groups "swupdate" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an email to swupdate+u...@googlegroups.com
>> > <mailto:swupdate+u...@googlegroups.com>.
>> > To view this discussion on the web visit
>> > 
>> https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com
>> > <
>> https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com?utm_medium=email&utm_source=footer
>> >.
>>
>>
>> -- 
>> =====================================================================
>> DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
>> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
>> Phone: +49-8142-66989-53 <+49%208142%206698953> Fax: +49-8142-66989-80 
>> <+49%208142%206698980> Email: sba...@denx.de
>> =====================================================================
>>
>

Stefano Babic Feb. 19, 2021, 8:26 p.m. UTC | #6

Hi Sava,

On 19.02.21 17:59, Sava Jakovljev wrote:
> Hello,
> 
> Just to provide a way that represents better what is the concern and why 
> this is a leak in my opinion, even though sockets are saved in swupdate 
> code:
> I ran a script which calls swupdate-progress in an infinite loop, waits 
> for short period of time, then exits swupdate-progress and continue the 
> loop.

It is not a mistery what it happens...

> After few minutes swupdate has more than 1024 open fd's.

Nothing else expected.

> This is clearly 
> a bug - why shouldn't we check for dead connections when accepting a new 
> one?

You have just proofed that your application (the test script) is a mess 
- please consider that the progress interface is a *LOCAL* interface 
based on UDS, and it is not open to the world. It is duty of the 
integrator to build a suitable system, that is to estimate how many 
progress client should be started and sets accordingly rights to the 
sockets to avoid local exploits.

> As a consequence, command IPC is not working anymore

Of course

> - Linux forbids 
> swupdate to open any new file

according to limits and/or cgroups, and it is a good thing.

> - it doesn't matter at all that progress 
> IPC and command IPC are separated, they are part of the same process - 
> so I don't think this is out of context or that I am mixing any terms. 
>  From Linux point of view, a process is simply exceeding maximum number 
> of open files so it won't allow it to take any more.

Exactly. But it is not that it is possible to generate any mess on the 
device and then try to fix somewhere else. I could also randomly write 
into /dev/mem and then complying that something crashes..

> 
> Waiting for swupdate to send a message to progress IPC to clear them up 
> is not a good solution, in my opinion.

The correct way to check if a connection is still alive is to use the 
connection, that is sending / receiving or using the KEEP_ALIVE option. 
But as this is a UDS, keep alive is not available and it cannot be 
replaced by a message at the application level (in SWUpdate, then), 
because the progress interface is thought to be unidirectional (no ACK 
on the other side).

> Giving that swupdate is a background process which sits idle most of the 
> time and that progress IPC is only sent when there is an upgrade, 
> clearing dangling fd's  point when message is actually sent, which is 
> not often, and worse,  when upgrade status is changed, is unacceptable, 
> IMHO.

Agree, but what is unacceptable is to force this to a private and local 
interface, that is for definition unprotected.

> I'm just trying to be clear here, that swupdate should be robust 
> and stable regardless of buggy clients which could spawn many 
> re-connections.

Where are the buggy clients ? If we are talking about clients connecting 
through internet and try to force a DOS, fully agree. But if crappy 
clients are implemented and put on the device, this is not duty of 
SWUpdate. If any program can be executed, then socket's access must be 
protected.

> Having "Too many open files" error is not what I would 
> expect from an upgrade service in any scenario.

I can break any device if I can run what I want on it.

> 
> Regarding my implementation of the check, we can use getsockopt with 
> SOL_SOCKET if you prefer that better,

I guess this does not work - getsockopt just returns the last error if 
this happened. That means, after doing something on the socket, that is 
by reading (but not available here as the socket is just to send), or by 
writing (as it is already done). If you call getsockopt(sock, 
SOL_ERROR...) without any operation, there is no error at all.

> and if you agree with my reasoning 
> - if I am making an error here, please let me know.

If SWUpdate should monitor the client connections, a keep alive 
mechanism is necessary. Linux's keepalive does not work for UDS, so this 
should be added in SWUpdate, and the thread should send periodically 
messages to the client (messages that should be discarded by the client) 
to verify the connection

Best regards,
Stefano Babic

> 
> Best regards,
> Sava Jakovljev
> Sava Jakovljev schrieb am Freitag, 19. Februar 2021 um 14:34:31 UTC+1:
> 
>     Hi Stefano,
> 
>     Sure, if you want to keep it this way, no problem for me - as
>     stated, it's mostly to have a bit better clean-up.
>     I would argue the clean-up could be done better (to be precise, you
>     are right and there is no leak, but one could get into a bad
>     situation anyway, (not easy, granted) so technicality doesn't really
>     matter).
>     For our needs we will keep this kind of clean-up, taking your
>     comments into account, but yeah, feel free to ignore it if you think
>     that's the best.
> 
>     Thank you.
>     Best regards,
>     Sava Jakovljev
> 
>     Stefano Babic schrieb am Freitag, 19. Februar 2021 um 14:25:11 UTC+1:
> 
>         Hi Sava,
> 
>         On 19.02.21 14:04, Sava Jakovljev wrote:
>          > Hi Stefano,
>          >
>          > It's not really necessary, but it deals with the situation
>         when clients
>          > connect and disconnect, and there are no progress messages in
>         between.
> 
>         The progress IPC is local, ther eis now way to connect it
>         outside the
>         device. If a client connects and disconnects, it is simply a buggy
>         application that must be fixed.
> 
>          > We encountered that situation in a test-scenario, not a
>         real-world one,
>          > but in my opinion, having this kind of resource leak should
>         be handled
>          > as soon as possible, and not just when we want to send a
>         message.
> 
>         This is not a resouce leak because SWUpdate does not really know
>         if the
>         connection is still on and using FIONREAD to detect this is not
>         correct
>         IMHO.
> 
>          > Also, it is not impossible for problem to occur in real-world
>         - one just
>          > has to try hard enough.
> 
>         It remains a buggy application - if we want to protect this, we
>         should
>         use what Linux already offers, for example setting accesses to
>         the socket.
> 
>          > And yes, it will be solved on first progress
>          > message sent, but then it might just be a bit too late.
> 
>         I do not think so.
> 
>          > For example, if
>          > one wants to use control IPC,
> 
>         control IPC does not flow into the progress, they have an own
>         connection
>         where you found resource leak that you fix with the first patch.
> 
>          > having the response of "you have to wait
>          > until we close our connections" is a no-go
> 
>         But again, this is caused by the resource leak in [1].
> 
>          > - and that will happen if
>          > this problem occurs. I agree, it is unlikely - but it is
>         possible.
> 
>         This is out of context - you are mixing the progress interface
>         with the
>         control interface. The progress interface is unidirectional, and
>         data
>         flow from SWUpdate to the registered client.
> 
>          > I don't see much need to periodically check for dead
>         connections,
> 
>         I do not see it at all.
> 
>          > so
>          > this is a middle-ground - if my reasoning is right, we are
>         actually
>          > preventing any resource leakage this way without too much
>         overhead.
> 
>         I do not think there is a resource leak in progress interface,
>         while
>         there is clearly in the control interface.
> 
>         Best regards,
>         Stefano Babic
> 
>          >
>          > Thank you.
>          > Best regards,
>          > Sava Jakovljev
>          > Stefano Babic schrieb am Freitag, 19. Februar 2021 um
>         08:56:34 UTC+1:
>          >
>          > Hi Sava,
>          >
>          > On 18.02.21 20:51, Sava Jakovljev wrote:
>          > > ---
>          > > core/progress_thread.c | 38
>         ++++++++++++++++++++++++++++++++++++++
>          > > 1 file changed, 38 insertions(+)
>          > >
>          > > diff --git a/core/progress_thread.c b/core/progress_thread.c
>          > > index 1d4facd..8705bfd 100644
>          > > --- a/core/progress_thread.c
>          > > +++ b/core/progress_thread.c
>          > > @@ -56,6 +56,43 @@ struct swupdate_progress {
>          > > };
>          > > static struct swupdate_progress progress;
>          > >
>          > > +static bool is_socket_closed(int sock)
>          > > +{
>          > > + fd_set fdset;
>          > > +
>          > > + FD_ZERO(&fdset);
>          > > + FD_SET(sock, &fdset);
>          > > +
>          > > + struct timeval tv = {0};
>          > > + select(sock + 1, &fdset, 0, 0, &tv);
>          > > +
>          > > + if (!FD_ISSET(sock, &fdset))
>          > > + return false;
>          > > +
>          > > + int n = 0;
>          > > + ioctl(sock, FIONREAD, &n);
>          > > +
>          >
>          > Why the socket is closed if there are no data to be read ? It
>         looks to
>          > me unrelated.
>          >
>          > > + return n == 0;
>          > > +}
>          > > +
>          > > +static void remove_dead_sockets(struct swupdate_progress*
>         pprog)
>          > > +{
>          > > + struct progress_conn *conn, *tmp;
>          > > +
>          > > + SIMPLEQ_FOREACH_SAFE(conn, &pprog->conns, next, tmp)
>          > > + {
>          > > + if (is_socket_closed(conn->sockfd))
>          > > + {
>          > > + TRACE("A progress client disappeared, removing it.");
>          > > + close(conn->sockfd);
>          > > +
>          > > + SIMPLEQ_REMOVE(&pprog->conns, conn,
>          > > + progress_conn, next);
>          > > + free(conn);
>          > > + }
>          > > + }
>          > > +}
>          > > +
>          > > /*
>          > > * This must be called after acquiring the mutex
>          > > * for the progress structure
>          > > @@ -282,6 +319,7 @@ void *progress_bar_thread (void
>         __attribute__
>          > ((__unused__)) *data)
>          > > conn->sockfd = connfd;
>          > > pthread_mutex_lock(&pprog->lock);
>          > > SIMPLEQ_INSERT_TAIL(&pprog->conns, conn, next);
>          > > + remove_dead_sockets(pprog);
>          >
>          > Why do we need this housekeeping here ?
>          >
>          > The connection are verified any time a message is sent in
>          > send_progress_msg(). If socket is closed, write() returns
>         with error
>          > and
>          > the connection is removed from the list. This seems to me
>         arbitrary.
>          >
>          > Best regards,
>          > Stefano Babic
>          >
>          >
>          >
>          >
>          > > pthread_mutex_unlock(&pprog->lock);
>          > > } while(1);
>          > > }
>          > >
>          >
>          > --
>          >
>         =====================================================================
> 
>          > DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
>          > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell,
>         Germany
>          > Phone: +49-8142-66989-53 <tel:+49%208142%206698953>
>         <tel:+49%208142%206698953> Fax:
>          > +49-8142-66989-80 <tel:+49%208142%206698980>
>         <tel:+49%208142%206698980> Email: sba...@denx.de
>          >
>         =====================================================================
> 
>          >
>          > --
>          > You received this message because you are subscribed to the
>         Google
>          > Groups "swupdate" group.
>          > To unsubscribe from this group and stop receiving emails from
>         it, send
>          > an email to swupdate+u...@googlegroups.com
>          > <mailto:swupdate+u...@googlegroups.com>.
>          > To view this discussion on the web visit
>          >
>         https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com
> 
>          >
>         <https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com?utm_medium=email&utm_source=footer>.
> 
> 
> 
>         -- 
>         =====================================================================
> 
>         DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
>         HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell,
>         Germany
>         Phone: +49-8142-66989-53 <tel:+49%208142%206698953> Fax:
>         +49-8142-66989-80 <tel:+49%208142%206698980> Email: sba...@denx.de
>         =====================================================================
> 
> 
> -- 
> You received this message because you are subscribed to the Google 
> Groups "swupdate" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to swupdate+unsubscribe@googlegroups.com 
> <mailto:swupdate+unsubscribe@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/swupdate/e6f35507-db00-460b-b3cb-4a136c24ee65n%40googlegroups.com 
> <https://groups.google.com/d/msgid/swupdate/e6f35507-db00-460b-b3cb-4a136c24ee65n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sava Jakovljev Feb. 19, 2021, 9:59 p.m. UTC | #7

Hi,

I have proved that *swupdate* can easily go into total chaos mode because 
it is not handling its resources appropriately - that's why I wrote that 
test script.
If you think that having a program which can be taken into a broken state 
*that easily* is acceptable, then I will drop this issue.
The  test script it is not part of our system - it is a demonstration of 
how easily I can break the software. Upgrade service must not get into that 
state that easily - also, a program on embedded Linux device must take 
special care about it's resources. 

Your statement that you can brake any device if you try doesn't mean you 
should not try to prevent obvious cases.
Upgrade software should simply work and be reliable - that means managing 
its own resources *at least*. It's not much to ask. Doing the check for 
dead connections only when actually sending a message is simply a bad 
design.

Best regards,
Sava Jakovljev



Stefano Babic schrieb am Freitag, 19. Februar 2021 um 21:26:51 UTC+1:

> Hi Sava,
>
> On 19.02.21 17:59, Sava Jakovljev wrote:
> > Hello,
> > 
> > Just to provide a way that represents better what is the concern and why 
> > this is a leak in my opinion, even though sockets are saved in swupdate 
> > code:
> > I ran a script which calls swupdate-progress in an infinite loop, waits 
> > for short period of time, then exits swupdate-progress and continue the 
> > loop.
>
> It is not a mistery what it happens...
>
> > After few minutes swupdate has more than 1024 open fd's.
>
> Nothing else expected.
>
> > This is clearly 
> > a bug - why shouldn't we check for dead connections when accepting a new 
> > one?
>
> You have just proofed that your application (the test script) is a mess 
> - please consider that the progress interface is a *LOCAL* interface 
> based on UDS, and it is not open to the world. It is duty of the 
> integrator to build a suitable system, that is to estimate how many 
> progress client should be started and sets accordingly rights to the 
> sockets to avoid local exploits.
>
> > As a consequence, command IPC is not working anymore
>
> Of course
>
> > - Linux forbids 
> > swupdate to open any new file
>
> according to limits and/or cgroups, and it is a good thing.
>
> > - it doesn't matter at all that progress 
> > IPC and command IPC are separated, they are part of the same process - 
> > so I don't think this is out of context or that I am mixing any terms. 
> > From Linux point of view, a process is simply exceeding maximum number 
> > of open files so it won't allow it to take any more.
>
> Exactly. But it is not that it is possible to generate any mess on the 
> device and then try to fix somewhere else. I could also randomly write 
> into /dev/mem and then complying that something crashes..
>
> > 
> > Waiting for swupdate to send a message to progress IPC to clear them up 
> > is not a good solution, in my opinion.
>
> The correct way to check if a connection is still alive is to use the 
> connection, that is sending / receiving or using the KEEP_ALIVE option. 
> But as this is a UDS, keep alive is not available and it cannot be 
> replaced by a message at the application level (in SWUpdate, then), 
> because the progress interface is thought to be unidirectional (no ACK 
> on the other side).
>
> > Giving that swupdate is a background process which sits idle most of the 
> > time and that progress IPC is only sent when there is an upgrade, 
> > clearing dangling fd's  point when message is actually sent, which is 
> > not often, and worse,  when upgrade status is changed, is unacceptable, 
> > IMHO.
>
> Agree, but what is unacceptable is to force this to a private and local 
> interface, that is for definition unprotected.
>
> > I'm just trying to be clear here, that swupdate should be robust 
> > and stable regardless of buggy clients which could spawn many 
> > re-connections.
>
> Where are the buggy clients ? If we are talking about clients connecting 
> through internet and try to force a DOS, fully agree. But if crappy 
> clients are implemented and put on the device, this is not duty of 
> SWUpdate. If any program can be executed, then socket's access must be 
> protected.
>
> > Having "Too many open files" error is not what I would 
> > expect from an upgrade service in any scenario.
>
> I can break any device if I can run what I want on it.
>
> > 
> > Regarding my implementation of the check, we can use getsockopt with 
> > SOL_SOCKET if you prefer that better,
>
> I guess this does not work - getsockopt just returns the last error if 
> this happened. That means, after doing something on the socket, that is 
> by reading (but not available here as the socket is just to send), or by 
> writing (as it is already done). If you call getsockopt(sock, 
> SOL_ERROR...) without any operation, there is no error at all.
>
> > and if you agree with my reasoning 
> > - if I am making an error here, please let me know.
>
> If SWUpdate should monitor the client connections, a keep alive 
> mechanism is necessary. Linux's keepalive does not work for UDS, so this 
> should be added in SWUpdate, and the thread should send periodically 
> messages to the client (messages that should be discarded by the client) 
> to verify the connection
>
> Best regards,
> Stefano Babic
>
> > 
> > Best regards,
> > Sava Jakovljev
> > Sava Jakovljev schrieb am Freitag, 19. Februar 2021 um 14:34:31 UTC+1:
> > 
> > Hi Stefano,
> > 
> > Sure, if you want to keep it this way, no problem for me - as
> > stated, it's mostly to have a bit better clean-up.
> > I would argue the clean-up could be done better (to be precise, you
> > are right and there is no leak, but one could get into a bad
> > situation anyway, (not easy, granted) so technicality doesn't really
> > matter).
> > For our needs we will keep this kind of clean-up, taking your
> > comments into account, but yeah, feel free to ignore it if you think
> > that's the best.
> > 
> > Thank you.
> > Best regards,
> > Sava Jakovljev
> > 
> > Stefano Babic schrieb am Freitag, 19. Februar 2021 um 14:25:11 UTC+1:
> > 
> > Hi Sava,
> > 
> > On 19.02.21 14:04, Sava Jakovljev wrote:
> > > Hi Stefano,
> > >
> > > It's not really necessary, but it deals with the situation
> > when clients
> > > connect and disconnect, and there are no progress messages in
> > between.
> > 
> > The progress IPC is local, ther eis now way to connect it
> > outside the
> > device. If a client connects and disconnects, it is simply a buggy
> > application that must be fixed.
> > 
> > > We encountered that situation in a test-scenario, not a
> > real-world one,
> > > but in my opinion, having this kind of resource leak should
> > be handled
> > > as soon as possible, and not just when we want to send a
> > message.
> > 
> > This is not a resouce leak because SWUpdate does not really know
> > if the
> > connection is still on and using FIONREAD to detect this is not
> > correct
> > IMHO.
> > 
> > > Also, it is not impossible for problem to occur in real-world
> > - one just
> > > has to try hard enough.
> > 
> > It remains a buggy application - if we want to protect this, we
> > should
> > use what Linux already offers, for example setting accesses to
> > the socket.
> > 
> > > And yes, it will be solved on first progress
> > > message sent, but then it might just be a bit too late.
> > 
> > I do not think so.
> > 
> > > For example, if
> > > one wants to use control IPC,
> > 
> > control IPC does not flow into the progress, they have an own
> > connection
> > where you found resource leak that you fix with the first patch.
> > 
> > > having the response of "you have to wait
> > > until we close our connections" is a no-go
> > 
> > But again, this is caused by the resource leak in [1].
> > 
> > > - and that will happen if
> > > this problem occurs. I agree, it is unlikely - but it is
> > possible.
> > 
> > This is out of context - you are mixing the progress interface
> > with the
> > control interface. The progress interface is unidirectional, and
> > data
> > flow from SWUpdate to the registered client.
> > 
> > > I don't see much need to periodically check for dead
> > connections,
> > 
> > I do not see it at all.
> > 
> > > so
> > > this is a middle-ground - if my reasoning is right, we are
> > actually
> > > preventing any resource leakage this way without too much
> > overhead.
> > 
> > I do not think there is a resource leak in progress interface,
> > while
> > there is clearly in the control interface.
> > 
> > Best regards,
> > Stefano Babic
> > 
> > >
> > > Thank you.
> > > Best regards,
> > > Sava Jakovljev
> > > Stefano Babic schrieb am Freitag, 19. Februar 2021 um
> > 08:56:34 UTC+1:
> > >
> > > Hi Sava,
> > >
> > > On 18.02.21 20:51, Sava Jakovljev wrote:
> > > > ---
> > > > core/progress_thread.c | 38
> > ++++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 38 insertions(+)
> > > >
> > > > diff --git a/core/progress_thread.c b/core/progress_thread.c
> > > > index 1d4facd..8705bfd 100644
> > > > --- a/core/progress_thread.c
> > > > +++ b/core/progress_thread.c
> > > > @@ -56,6 +56,43 @@ struct swupdate_progress {
> > > > };
> > > > static struct swupdate_progress progress;
> > > >
> > > > +static bool is_socket_closed(int sock)
> > > > +{
> > > > + fd_set fdset;
> > > > +
> > > > + FD_ZERO(&fdset);
> > > > + FD_SET(sock, &fdset);
> > > > +
> > > > + struct timeval tv = {0};
> > > > + select(sock + 1, &fdset, 0, 0, &tv);
> > > > +
> > > > + if (!FD_ISSET(sock, &fdset))
> > > > + return false;
> > > > +
> > > > + int n = 0;
> > > > + ioctl(sock, FIONREAD, &n);
> > > > +
> > >
> > > Why the socket is closed if there are no data to be read ? It
> > looks to
> > > me unrelated.
> > >
> > > > + return n == 0;
> > > > +}
> > > > +
> > > > +static void remove_dead_sockets(struct swupdate_progress*
> > pprog)
> > > > +{
> > > > + struct progress_conn *conn, *tmp;
> > > > +
> > > > + SIMPLEQ_FOREACH_SAFE(conn, &pprog->conns, next, tmp)
> > > > + {
> > > > + if (is_socket_closed(conn->sockfd))
> > > > + {
> > > > + TRACE("A progress client disappeared, removing it.");
> > > > + close(conn->sockfd);
> > > > +
> > > > + SIMPLEQ_REMOVE(&pprog->conns, conn,
> > > > + progress_conn, next);
> > > > + free(conn);
> > > > + }
> > > > + }
> > > > +}
> > > > +
> > > > /*
> > > > * This must be called after acquiring the mutex
> > > > * for the progress structure
> > > > @@ -282,6 +319,7 @@ void *progress_bar_thread (void
> > __attribute__
> > > ((__unused__)) *data)
> > > > conn->sockfd = connfd;
> > > > pthread_mutex_lock(&pprog->lock);
> > > > SIMPLEQ_INSERT_TAIL(&pprog->conns, conn, next);
> > > > + remove_dead_sockets(pprog);
> > >
> > > Why do we need this housekeeping here ?
> > >
> > > The connection are verified any time a message is sent in
> > > send_progress_msg(). If socket is closed, write() returns
> > with error
> > > and
> > > the connection is removed from the list. This seems to me
> > arbitrary.
> > >
> > > Best regards,
> > > Stefano Babic
> > >
> > >
> > >
> > >
> > > > pthread_mutex_unlock(&pprog->lock);
> > > > } while(1);
> > > > }
> > > >
> > >
> > > --
> > >
> > =====================================================================
> > 
> > > DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
> > > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell,
> > Germany
> > > Phone: +49-8142-66989-53 <+49%208142%206698953> 
> <tel:+49%208142%206698953>
> > <tel:+49%208142%206698953> Fax:
> > > +49-8142-66989-80 <+49%208142%206698980> <tel:+49%208142%206698980>
> > <tel:+49%208142%206698980> Email: sba...@denx.de
> > >
> > =====================================================================
> > 
> > >
> > > --
> > > You received this message because you are subscribed to the
> > Google
> > > Groups "swupdate" group.
> > > To unsubscribe from this group and stop receiving emails from
> > it, send
> > > an email to swupdate+u...@googlegroups.com
> > > <mailto:swupdate+u...@googlegroups.com>.
> > > To view this discussion on the web visit
> > >
> > 
> https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com
> > 
> > >
> > <
> https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com?utm_medium=email&utm_source=footer
> >.
> > 
> > 
> > 
> > -- 
> > =====================================================================
> > 
> > DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
> > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell,
> > Germany
> > Phone: +49-8142-66989-53 <+49%208142%206698953> 
> <tel:+49%208142%206698953> Fax:
> > +49-8142-66989-80 <+49%208142%206698980> <tel:+49%208142%206698980> 
> Email: sba...@denx.de
> > =====================================================================
> > 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> > Groups "swupdate" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> > an email to swupdate+u...@googlegroups.com 
> > <mailto:swupdate+u...@googlegroups.com>.
> > To view this discussion on the web visit 
> > 
> https://groups.google.com/d/msgid/swupdate/e6f35507-db00-460b-b3cb-4a136c24ee65n%40googlegroups.com 
> > <
> https://groups.google.com/d/msgid/swupdate/e6f35507-db00-460b-b3cb-4a136c24ee65n%40googlegroups.com?utm_medium=email&utm_source=footer
> >.
>

Stefano Babic Feb. 19, 2021, 10:13 p.m. UTC | #8

Hi Sava,

On 19.02.21 22:59, Sava Jakovljev wrote:
> Hi,
> 
> I have proved that *swupdate* can easily go into total chaos mode 
> because it is not handling its resources appropriately - that's why I 
> wrote that test script.

What is the difference between your test and another one where a client 
calls in al loop progress_ipc_connect() ?

This is not solved checking dead connections - there are no dead 
connections at all.

> If you think that having a program which can be taken into a broken 
> state *that easily* is acceptable,

see my test above, it is also so easy and your patch does not solve it.

> then I will drop this issue.
> The  test script it is not part of our system - it is a demonstration of 
> how easily I can break the software.

I am aware of it - but the IPC interface on SWUpdate are not thought to 
be fool proof. There is neither authentication nor other ways.

What I suggest is to limit the number of client connections - it is 
currently unlimited. I cannot think any reason why there could be so 
many clients. If SWUpdate reaches the limit for progress IPC clients, 
will reject new connections.

> Upgrade service must not get into 
> that state that easily - also, a program on embedded Linux device must 
> take special care about it's resources.

Sure, but let's say: the progress IPC is not part of the upgrade, that 
is an SWUpdate does not (quite) need it itself.
But if this interface is used, it is duty of the developers / 
integrators to be sure that the clients they write does not break the 
upgrade concept, because they clients *become* part of the updater.

> 
> Your statement that you can brake any device if you try doesn't mean you 
> should not try to prevent obvious cases.
> Upgrade software should simply work and be reliable - that means 
> managing its own resources *at least*. It's not much to ask. Doing the 
> check for dead connections

It does not work, see my example above, it is just another test case. 
Then just limit the number of clients.

> only when actually sending a message is 
> simply a bad design.

It is not - this is the common way to check if a connection is 
established, the keep alive mechanism is not doing anything different.

Best regards,
Stefano Babic

> 
> Best regards,
> Sava Jakovljev
> 
> 
> 
> Stefano Babic schrieb am Freitag, 19. Februar 2021 um 21:26:51 UTC+1:
> 
>     Hi Sava,
> 
>     On 19.02.21 17:59, Sava Jakovljev wrote:
>      > Hello,
>      >
>      > Just to provide a way that represents better what is the concern
>     and why
>      > this is a leak in my opinion, even though sockets are saved in
>     swupdate
>      > code:
>      > I ran a script which calls swupdate-progress in an infinite loop,
>     waits
>      > for short period of time, then exits swupdate-progress and
>     continue the
>      > loop.
> 
>     It is not a mistery what it happens...
> 
>      > After few minutes swupdate has more than 1024 open fd's.
> 
>     Nothing else expected.
> 
>      > This is clearly
>      > a bug - why shouldn't we check for dead connections when
>     accepting a new
>      > one?
> 
>     You have just proofed that your application (the test script) is a mess
>     - please consider that the progress interface is a *LOCAL* interface
>     based on UDS, and it is not open to the world. It is duty of the
>     integrator to build a suitable system, that is to estimate how many
>     progress client should be started and sets accordingly rights to the
>     sockets to avoid local exploits.
> 
>      > As a consequence, command IPC is not working anymore
> 
>     Of course
> 
>      > - Linux forbids
>      > swupdate to open any new file
> 
>     according to limits and/or cgroups, and it is a good thing.
> 
>      > - it doesn't matter at all that progress
>      > IPC and command IPC are separated, they are part of the same
>     process -
>      > so I don't think this is out of context or that I am mixing any
>     terms.
>      > From Linux point of view, a process is simply exceeding maximum
>     number
>      > of open files so it won't allow it to take any more.
> 
>     Exactly. But it is not that it is possible to generate any mess on the
>     device and then try to fix somewhere else. I could also randomly write
>     into /dev/mem and then complying that something crashes..
> 
>      >
>      > Waiting for swupdate to send a message to progress IPC to clear
>     them up
>      > is not a good solution, in my opinion.
> 
>     The correct way to check if a connection is still alive is to use the
>     connection, that is sending / receiving or using the KEEP_ALIVE option.
>     But as this is a UDS, keep alive is not available and it cannot be
>     replaced by a message at the application level (in SWUpdate, then),
>     because the progress interface is thought to be unidirectional (no ACK
>     on the other side).
> 
>      > Giving that swupdate is a background process which sits idle most
>     of the
>      > time and that progress IPC is only sent when there is an upgrade,
>      > clearing dangling fd's  point when message is actually sent,
>     which is
>      > not often, and worse,  when upgrade status is changed, is
>     unacceptable,
>      > IMHO.
> 
>     Agree, but what is unacceptable is to force this to a private and local
>     interface, that is for definition unprotected.
> 
>      > I'm just trying to be clear here, that swupdate should be robust
>      > and stable regardless of buggy clients which could spawn many
>      > re-connections.
> 
>     Where are the buggy clients ? If we are talking about clients
>     connecting
>     through internet and try to force a DOS, fully agree. But if crappy
>     clients are implemented and put on the device, this is not duty of
>     SWUpdate. If any program can be executed, then socket's access must be
>     protected.
> 
>      > Having "Too many open files" error is not what I would
>      > expect from an upgrade service in any scenario.
> 
>     I can break any device if I can run what I want on it.
> 
>      >
>      > Regarding my implementation of the check, we can use getsockopt with
>      > SOL_SOCKET if you prefer that better,
> 
>     I guess this does not work - getsockopt just returns the last error if
>     this happened. That means, after doing something on the socket, that is
>     by reading (but not available here as the socket is just to send),
>     or by
>     writing (as it is already done). If you call getsockopt(sock,
>     SOL_ERROR...) without any operation, there is no error at all.
> 
>      > and if you agree with my reasoning
>      > - if I am making an error here, please let me know.
> 
>     If SWUpdate should monitor the client connections, a keep alive
>     mechanism is necessary. Linux's keepalive does not work for UDS, so
>     this
>     should be added in SWUpdate, and the thread should send periodically
>     messages to the client (messages that should be discarded by the
>     client)
>     to verify the connection
> 
>     Best regards,
>     Stefano Babic
> 
>      >
>      > Best regards,
>      > Sava Jakovljev
>      > Sava Jakovljev schrieb am Freitag, 19. Februar 2021 um 14:34:31
>     UTC+1:
>      >
>      > Hi Stefano,
>      >
>      > Sure, if you want to keep it this way, no problem for me - as
>      > stated, it's mostly to have a bit better clean-up.
>      > I would argue the clean-up could be done better (to be precise, you
>      > are right and there is no leak, but one could get into a bad
>      > situation anyway, (not easy, granted) so technicality doesn't really
>      > matter).
>      > For our needs we will keep this kind of clean-up, taking your
>      > comments into account, but yeah, feel free to ignore it if you think
>      > that's the best.
>      >
>      > Thank you.
>      > Best regards,
>      > Sava Jakovljev
>      >
>      > Stefano Babic schrieb am Freitag, 19. Februar 2021 um 14:25:11
>     UTC+1:
>      >
>      > Hi Sava,
>      >
>      > On 19.02.21 14:04, Sava Jakovljev wrote:
>      > > Hi Stefano,
>      > >
>      > > It's not really necessary, but it deals with the situation
>      > when clients
>      > > connect and disconnect, and there are no progress messages in
>      > between.
>      >
>      > The progress IPC is local, ther eis now way to connect it
>      > outside the
>      > device. If a client connects and disconnects, it is simply a buggy
>      > application that must be fixed.
>      >
>      > > We encountered that situation in a test-scenario, not a
>      > real-world one,
>      > > but in my opinion, having this kind of resource leak should
>      > be handled
>      > > as soon as possible, and not just when we want to send a
>      > message.
>      >
>      > This is not a resouce leak because SWUpdate does not really know
>      > if the
>      > connection is still on and using FIONREAD to detect this is not
>      > correct
>      > IMHO.
>      >
>      > > Also, it is not impossible for problem to occur in real-world
>      > - one just
>      > > has to try hard enough.
>      >
>      > It remains a buggy application - if we want to protect this, we
>      > should
>      > use what Linux already offers, for example setting accesses to
>      > the socket.
>      >
>      > > And yes, it will be solved on first progress
>      > > message sent, but then it might just be a bit too late.
>      >
>      > I do not think so.
>      >
>      > > For example, if
>      > > one wants to use control IPC,
>      >
>      > control IPC does not flow into the progress, they have an own
>      > connection
>      > where you found resource leak that you fix with the first patch.
>      >
>      > > having the response of "you have to wait
>      > > until we close our connections" is a no-go
>      >
>      > But again, this is caused by the resource leak in [1].
>      >
>      > > - and that will happen if
>      > > this problem occurs. I agree, it is unlikely - but it is
>      > possible.
>      >
>      > This is out of context - you are mixing the progress interface
>      > with the
>      > control interface. The progress interface is unidirectional, and
>      > data
>      > flow from SWUpdate to the registered client.
>      >
>      > > I don't see much need to periodically check for dead
>      > connections,
>      >
>      > I do not see it at all.
>      >
>      > > so
>      > > this is a middle-ground - if my reasoning is right, we are
>      > actually
>      > > preventing any resource leakage this way without too much
>      > overhead.
>      >
>      > I do not think there is a resource leak in progress interface,
>      > while
>      > there is clearly in the control interface.
>      >
>      > Best regards,
>      > Stefano Babic
>      >
>      > >
>      > > Thank you.
>      > > Best regards,
>      > > Sava Jakovljev
>      > > Stefano Babic schrieb am Freitag, 19. Februar 2021 um
>      > 08:56:34 UTC+1:
>      > >
>      > > Hi Sava,
>      > >
>      > > On 18.02.21 20:51, Sava Jakovljev wrote:
>      > > > ---
>      > > > core/progress_thread.c | 38
>      > ++++++++++++++++++++++++++++++++++++++
>      > > > 1 file changed, 38 insertions(+)
>      > > >
>      > > > diff --git a/core/progress_thread.c b/core/progress_thread.c
>      > > > index 1d4facd..8705bfd 100644
>      > > > --- a/core/progress_thread.c
>      > > > +++ b/core/progress_thread.c
>      > > > @@ -56,6 +56,43 @@ struct swupdate_progress {
>      > > > };
>      > > > static struct swupdate_progress progress;
>      > > >
>      > > > +static bool is_socket_closed(int sock)
>      > > > +{
>      > > > + fd_set fdset;
>      > > > +
>      > > > + FD_ZERO(&fdset);
>      > > > + FD_SET(sock, &fdset);
>      > > > +
>      > > > + struct timeval tv = {0};
>      > > > + select(sock + 1, &fdset, 0, 0, &tv);
>      > > > +
>      > > > + if (!FD_ISSET(sock, &fdset))
>      > > > + return false;
>      > > > +
>      > > > + int n = 0;
>      > > > + ioctl(sock, FIONREAD, &n);
>      > > > +
>      > >
>      > > Why the socket is closed if there are no data to be read ? It
>      > looks to
>      > > me unrelated.
>      > >
>      > > > + return n == 0;
>      > > > +}
>      > > > +
>      > > > +static void remove_dead_sockets(struct swupdate_progress*
>      > pprog)
>      > > > +{
>      > > > + struct progress_conn *conn, *tmp;
>      > > > +
>      > > > + SIMPLEQ_FOREACH_SAFE(conn, &pprog->conns, next, tmp)
>      > > > + {
>      > > > + if (is_socket_closed(conn->sockfd))
>      > > > + {
>      > > > + TRACE("A progress client disappeared, removing it.");
>      > > > + close(conn->sockfd);
>      > > > +
>      > > > + SIMPLEQ_REMOVE(&pprog->conns, conn,
>      > > > + progress_conn, next);
>      > > > + free(conn);
>      > > > + }
>      > > > + }
>      > > > +}
>      > > > +
>      > > > /*
>      > > > * This must be called after acquiring the mutex
>      > > > * for the progress structure
>      > > > @@ -282,6 +319,7 @@ void *progress_bar_thread (void
>      > __attribute__
>      > > ((__unused__)) *data)
>      > > > conn->sockfd = connfd;
>      > > > pthread_mutex_lock(&pprog->lock);
>      > > > SIMPLEQ_INSERT_TAIL(&pprog->conns, conn, next);
>      > > > + remove_dead_sockets(pprog);
>      > >
>      > > Why do we need this housekeeping here ?
>      > >
>      > > The connection are verified any time a message is sent in
>      > > send_progress_msg(). If socket is closed, write() returns
>      > with error
>      > > and
>      > > the connection is removed from the list. This seems to me
>      > arbitrary.
>      > >
>      > > Best regards,
>      > > Stefano Babic
>      > >
>      > >
>      > >
>      > >
>      > > > pthread_mutex_unlock(&pprog->lock);
>      > > > } while(1);
>      > > > }
>      > > >
>      > >
>      > > --
>      > >
>      >
>     =====================================================================
>      >
>      > > DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
>      > > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell,
>      > Germany
>      > > Phone: +49-8142-66989-53 <tel:+49%208142%206698953>
>     <tel:+49%208142%206698953>
>      > <tel:+49%208142%206698953> Fax:
>      > > +49-8142-66989-80 <tel:+49%208142%206698980>
>     <tel:+49%208142%206698980>
>      > <tel:+49%208142%206698980> Email: sba...@denx.de
>      > >
>      >
>     =====================================================================
>      >
>      > >
>      > > --
>      > > You received this message because you are subscribed to the
>      > Google
>      > > Groups "swupdate" group.
>      > > To unsubscribe from this group and stop receiving emails from
>      > it, send
>      > > an email to swupdate+u...@googlegroups.com
>      > > <mailto:swupdate+u...@googlegroups.com>.
>      > > To view this discussion on the web visit
>      > >
>      >
>     https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com
> 
>      >
>      > >
>      >
>     <https://groups.google.com/d/msgid/swupdate/5f07cb91-d747-4d5e-8abd-a45a3b546653n%40googlegroups.com?utm_medium=email&utm_source=footer>.
> 
>      >
>      >
>      >
>      > --
>      >
>     =====================================================================
>      >
>      > DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
>      > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell,
>      > Germany
>      > Phone: +49-8142-66989-53 <tel:+49%208142%206698953>
>     <tel:+49%208142%206698953> Fax:
>      > +49-8142-66989-80 <tel:+49%208142%206698980>
>     <tel:+49%208142%206698980> Email: sba...@denx.de
>      >
>     =====================================================================
>      >
>      >
>      > --
>      > You received this message because you are subscribed to the Google
>      > Groups "swupdate" group.
>      > To unsubscribe from this group and stop receiving emails from it,
>     send
>      > an email to swupdate+u...@googlegroups.com
>      > <mailto:swupdate+u...@googlegroups.com>.
>      > To view this discussion on the web visit
>      >
>     https://groups.google.com/d/msgid/swupdate/e6f35507-db00-460b-b3cb-4a136c24ee65n%40googlegroups.com
> 
>      >
>     <https://groups.google.com/d/msgid/swupdate/e6f35507-db00-460b-b3cb-4a136c24ee65n%40googlegroups.com?utm_medium=email&utm_source=footer>.
> 
> 
> -- 
> You received this message because you are subscribed to the Google 
> Groups "swupdate" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to swupdate+unsubscribe@googlegroups.com 
> <mailto:swupdate+unsubscribe@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/swupdate/63660164-9138-4bf0-94ba-a029a06ea80cn%40googlegroups.com 
> <https://groups.google.com/d/msgid/swupdate/63660164-9138-4bf0-94ba-a029a06ea80cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

[2/2] Check for dead connections on new progress IPC client

Commit Message

Comments

Patch