diff mbox series

[v2] lib: shell: Fix timeout process races

Message ID 20210920122146.31576-1-chrubis@suse.cz
State Changes Requested
Headers show
Series [v2] lib: shell: Fix timeout process races | expand

Commit Message

Cyril Hrubis Sept. 20, 2021, 12:21 p.m. UTC
There were actually several races in the shell library timeout handling.

This commit fixes hopefully all of them by:

* Reimplementing the backgroud timer in C

* Making sure that the timer has started before we leave test setup

The rewrite of the backround timer to C allows us to run all the timeout
logic in a single process, which simplifies the whole problem greatly
since previously we had chain of processes that estabilished signal
handlers to kill it's descendants, which in the end had a few races in
it.

The race that caused the problems is, as far as I can tell, in the way
how shell spawns it's children. I haven't checked the shell code, but I
guess that when shell runs a process in bacground it does vfork()
+ exec() and because signals are ignored during the operation. My guess
that signals are disabled between vfork() and exec(). If the SIGTERM
arrives at that point it gets lost.

That means that we created a race window in the shell library each time
we started a new process. The rewrite to C simplifies the code but we
still end up with a single place where this can happen and that is when
we execute the tst_timeout_kill binary. This is now fixed in the shell
library by waiting until the background process gets to a sleep state.

Reported-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
---
 testcases/lib/.gitignore         |  1 +
 testcases/lib/Makefile           |  2 +-
 testcases/lib/tst_test.sh        | 47 ++++++--------------
 testcases/lib/tst_timeout_kill.c | 75 ++++++++++++++++++++++++++++++++
 4 files changed, 90 insertions(+), 35 deletions(-)
 create mode 100644 testcases/lib/tst_timeout_kill.c

Changes in v2:

- simplify the message printing
- fix .gitignore entry

Comments

Petr Vorel Sept. 20, 2021, 4:11 p.m. UTC | #1
Hi Cyril,

> Changes in v2:

> - simplify the message printing
> - fix .gitignore entry

changes LGTM.

Reviewed-by: Petr Vorel <pvorel@suse.cz>
Tested-by: Petr Vorel <pvorel@suse.cz>

Kind regards,
Petr
Li Wang Sept. 21, 2021, 3:27 a.m. UTC | #2
Hi Cyril,



> +++ b/testcases/lib/tst_timeout_kill.c
> @@ -0,0 +1,75 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (c) 2021 Cyril Hrubis <chrubis@suse.cz>
> + */
> +
> +#include <stdio.h>
> +#include <signal.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <string.h>
> +
> +static void print_help(const char *name)
> +{
> +       fprintf(stderr, "usage: %s timeout pid\n", name);
> +}
> +
> +#define print_msg(fmt, ...) fprintf(stderr, fmt "\n", ##__VA_ARGS__)
> +
> +int main(int argc, char *argv[])
> +{
> +       int timeout, pid, ret, i;
> +
> +       if (argc != 3) {
> +               print_help(argv[0]);
> +               return 1;
> +       }
> +
> +       timeout = atoi(argv[1]);
> +       pid = atoi(argv[2]);
> +
> +       if (timeout < 0) {
> +               fprintf(stderr, "Invalid timeout '%s'\n", argv[1]);
> +               print_help(argv[0]);
> +               return 1;
> +       }
> +
> +       if (pid <= 1) {
> +               fprintf(stderr, "Invalid pid '%s'\n", argv[2]);
> +               print_help(argv[0]);
> +               return 1;
> +       }
> +
> +       if (timeout)
> +               sleep(timeout);
> +
> +       print_msg("Test timed out, sending SIGTERM!");
> +       print_msg("If you are running on slow machine, try exporting
> LTP_TIMEOUT_MUL > 1");
> +
> +       ret = kill(-pid, SIGTERM);
>

This 'pid' is the parent shell process id, so it obviously that
tst_timeout_kill
process would get signal SIGTERM as well.

I'm thinking maybe we should let tst_timeout_kill itself ignore SIGTERM
otherwise we have no chance to perform the following double-check code?



> +       if (ret) {
> +               print_msg("kill(%i) failed: %s", -pid, strerror(errno));
> +               return 1;
> +       }
> +
> +       usleep(100000);
> +
> +       i = 10;
> +
> +       while (!kill(-pid, 0) && i-- > 0) {
>

And just check 'pid' but not '-pid' as original shell behavior?


> +               print_msg("Test is still running...");
> +               sleep(1);
> +       }
> +
> +       if (!kill(-pid, 0)) {
>

check pid, if so like that.


> +               print_msg("Test is still running, sending SIGKILL");
> +               ret = kill(-pid, SIGKILL);
> +               if (ret) {
> +                       print_msg("kill(%i) failed: %s", -pid,
> strerror(errno));
> +                       return 1;
> +               }
> +       }
> +
> +       return 0;
> +}
> --
> 2.32.0
>
>
> --
> Mailing list info: https://lists.linux.it/listinfo/ltp
>
>
Cyril Hrubis Sept. 21, 2021, 11:30 a.m. UTC | #3
Hi!
> This 'pid' is the parent shell process id, so it obviously that
> tst_timeout_kill
> process would get signal SIGTERM as well.
> 
> I'm thinking maybe we should let tst_timeout_kill itself ignore SIGTERM
> otherwise we have no chance to perform the following double-check code?

I guess that signal(SIGTERM, SIG_IGN) a the start of the main() should
fix it.
Li Wang Sept. 22, 2021, 4:30 a.m. UTC | #4
On Tue, Sep 21, 2021 at 7:30 PM Cyril Hrubis <chrubis@suse.cz> wrote:

> Hi!
> > This 'pid' is the parent shell process id, so it obviously that
> > tst_timeout_kill
> > process would get signal SIGTERM as well.
> >
> > I'm thinking maybe we should let tst_timeout_kill itself ignore SIGTERM
> > otherwise we have no chance to perform the following double-check code?
>
> I guess that signal(SIGTERM, SIG_IGN) a the start of the main() should
> fix it.
>

It works, but better put it behind of sleep(timeout).

Because we still need to guarantee tst_timeout_kill can be
stopped by _tst_cleanup_timer before timeout happening.


--- a/testcases/lib/tst_timeout_kill.c
+++ b/testcases/lib/tst_timeout_kill.c
@@ -44,6 +44,8 @@ int main(int argc, char *argv[])
        if (timeout)
                sleep(timeout);

+       signal(SIGTERM, SIG_IGN);
+
        print_msg("Test timed out, sending SIGTERM!");
        print_msg("If you are running on slow machine, try exporting
LTP_TIMEOUT_MUL > 1");

@@ -57,12 +59,12 @@ int main(int argc, char *argv[])

        i = 10;

-       while (!kill(-pid, 0) && i-- > 0) {
+       while (!kill(pid, 0) && i-- > 0) {
                print_msg("Test is still running...");
                sleep(1);
        }

-       if (!kill(-pid, 0)) {
+       if (!kill(pid, 0)) {
                print_msg("Test is still running, sending SIGKILL");
                ret = kill(-pid, SIGKILL);
                if (ret) {
Cyril Hrubis Sept. 22, 2021, 9:02 a.m. UTC | #5
Hi!
> > > This 'pid' is the parent shell process id, so it obviously that
> > > tst_timeout_kill
> > > process would get signal SIGTERM as well.
> > >
> > > I'm thinking maybe we should let tst_timeout_kill itself ignore SIGTERM
> > > otherwise we have no chance to perform the following double-check code?
> >
> > I guess that signal(SIGTERM, SIG_IGN) a the start of the main() should
> > fix it.
> >
> 
> It works, but better put it behind of sleep(timeout).
> 
> Because we still need to guarantee tst_timeout_kill can be
> stopped by _tst_cleanup_timer before timeout happening.

Right, of course we want the timeout killer to be killable before the
timeout.

> --- a/testcases/lib/tst_timeout_kill.c
> +++ b/testcases/lib/tst_timeout_kill.c
> @@ -44,6 +44,8 @@ int main(int argc, char *argv[])
>         if (timeout)
>                 sleep(timeout);
> 
> +       signal(SIGTERM, SIG_IGN);
> +
>         print_msg("Test timed out, sending SIGTERM!");
>         print_msg("If you are running on slow machine, try exporting
> LTP_TIMEOUT_MUL > 1");
> 
> @@ -57,12 +59,12 @@ int main(int argc, char *argv[])
> 
>         i = 10;
> 
> -       while (!kill(-pid, 0) && i-- > 0) {
> +       while (!kill(pid, 0) && i-- > 0) {
>                 print_msg("Test is still running...");
>                 sleep(1);
>         }
> 
> -       if (!kill(-pid, 0)) {
> +       if (!kill(pid, 0)) {
>                 print_msg("Test is still running, sending SIGKILL");
>                 ret = kill(-pid, SIGKILL);
>                 if (ret) {

Hmm, and of course this does not work since the timeout kill process is
around. Maybe we should make this process a separate process group from
the start, what about calling setpgid(0, 0) instead of setting up the
signal handler? That way we can send the signals to the whole process
group and make sure everything has been cleaned up.
Li Wang Sept. 22, 2021, 9:13 a.m. UTC | #6
> Hmm, and of course this does not work since the timeout kill process is
> around. Maybe we should make this process a separate process group from
> the start, what about calling setpgid(0, 0) instead of setting up the
> signal handler? That way we can send the signals to the whole process
> group and make sure everything has been cleaned up.
>

Agreed, that sounds like a great method.
Cyril Hrubis Sept. 22, 2021, 9:21 a.m. UTC | #7
Hi!
> Agreed, that sounds like a great method.

Should I send v3 or can I commit v2 with this change?
Li Wang Sept. 22, 2021, 9:22 a.m. UTC | #8
On Wed, Sep 22, 2021 at 5:20 PM Cyril Hrubis <chrubis@suse.cz> wrote:

> Hi!
> > Agreed, that sounds like a great method.
>
> Should I send v3 or can I commit v2 with this change?
>

V3 please, I will help test it all.

I just think of setsid(), is that works and better than setpgid(0, 0)?
Cyril Hrubis Sept. 22, 2021, 9:31 a.m. UTC | #9
Hi!
> > > Agreed, that sounds like a great method.
> >
> > Should I send v3 or can I commit v2 with this change?
> >
> 
> V3 please, I will help test it all.

Ok, will send ASAP.

> I just think of setsid(), is that works and better than setpgid(0, 0)?

For moving the process out of the process group setpgid() is enough, we
use it in the lib/tst_test.c for the fork_testrun() in order to track
all children of the test process as well.

I do not think that we need a new session here, that would probably have
strange side effects, for instance it would isolate the tst_timeout_kill
process from the terminal the tests were started in.
Li Wang Sept. 22, 2021, 9:39 a.m. UTC | #10
On Wed, Sep 22, 2021 at 5:31 PM Cyril Hrubis <chrubis@suse.cz> wrote:

> Hi!
> > > > Agreed, that sounds like a great method.
> > >
> > > Should I send v3 or can I commit v2 with this change?
> > >
> >
> > V3 please, I will help test it all.
>
> Ok, will send ASAP.
>
Thanks.


>
> > I just think of setsid(), is that works and better than setpgid(0, 0)?
>
> For moving the process out of the process group setpgid() is enough, we
> use it in the lib/tst_test.c for the fork_testrun() in order to track
> all children of the test process as well.
>
> I do not think that we need a new session here, that would probably have
> strange side effects, for instance it would isolate the tst_timeout_kill
> process from the terminal the tests were started in.
>

Ok, thanks! Let's go with the safe method setpgid(0, 0).
diff mbox series

Patch

diff --git a/testcases/lib/.gitignore b/testcases/lib/.gitignore
index 5a0e8cba2..9625d9043 100644
--- a/testcases/lib/.gitignore
+++ b/testcases/lib/.gitignore
@@ -14,3 +14,4 @@ 
 /tst_sleep
 /tst_supported_fs
 /tst_hexdump
+/tst_timeout_kill
diff --git a/testcases/lib/Makefile b/testcases/lib/Makefile
index 179b47479..d6b4c7a91 100644
--- a/testcases/lib/Makefile
+++ b/testcases/lib/Makefile
@@ -11,6 +11,6 @@  INSTALL_TARGETS		:= *.sh
 MAKE_TARGETS		:= tst_sleep tst_random tst_checkpoint tst_rod tst_kvcmp\
 			   tst_device tst_net_iface_prefix tst_net_ip_prefix tst_net_vars\
 			   tst_getconf tst_supported_fs tst_check_drivers tst_get_unused_port\
-			   tst_get_median tst_hexdump tst_get_free_pids
+			   tst_get_median tst_hexdump tst_get_free_pids tst_timeout_kill
 
 include $(top_srcdir)/include/mk/generic_leaf_target.mk
diff --git a/testcases/lib/tst_test.sh b/testcases/lib/tst_test.sh
index 8f69b0551..2556b28f5 100644
--- a/testcases/lib/tst_test.sh
+++ b/testcases/lib/tst_test.sh
@@ -474,27 +474,6 @@  _tst_multiply_timeout()
 	return 0
 }
 
-_tst_kill_test()
-{
-	local i=10
-
-	trap '' TERM
-	tst_res TBROK "Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
-	kill -TERM -$pid
-	tst_sleep 100ms
-
-	while kill -0 $pid >/dev/null 2>&1 && [ $i -gt 0 ]; do
-		tst_res TINFO "Test is still running, waiting ${i}s"
-		sleep 1
-		i=$((i-1))
-	done
-
-	if kill -0 $pid >/dev/null 2>&1; then
-		tst_res TBROK "Test still running, sending SIGKILL"
-		kill -KILL -$pid
-	fi
-}
-
 _tst_cleanup_timer()
 {
 	if [ -n "$_tst_setup_timer_pid" ]; then
@@ -503,18 +482,6 @@  _tst_cleanup_timer()
 	fi
 }
 
-_tst_timeout_process()
-{
-	local sleep_pid
-
-	sleep $sec &
-	sleep_pid=$!
-	trap "kill $sleep_pid; exit" TERM
-	wait $sleep_pid
-	trap - TERM
-	_tst_kill_test
-}
-
 _tst_setup_timer()
 {
 	TST_TIMEOUT=${TST_TIMEOUT:-300}
@@ -539,9 +506,21 @@  _tst_setup_timer()
 
 	_tst_cleanup_timer
 
-	_tst_timeout_process &
+	tst_timeout_kill $sec $pid &
 
 	_tst_setup_timer_pid=$!
+
+	while true; do
+		local state
+
+		state=$(cut -d' ' -f3 "/proc/$_tst_setup_timer_pid/stat")
+
+		if [ "$state" = "S" ]; then
+			break;
+		fi
+
+		tst_sleep 1ms
+	done
 }
 
 tst_require_root()
diff --git a/testcases/lib/tst_timeout_kill.c b/testcases/lib/tst_timeout_kill.c
new file mode 100644
index 000000000..d55367d31
--- /dev/null
+++ b/testcases/lib/tst_timeout_kill.c
@@ -0,0 +1,75 @@ 
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2021 Cyril Hrubis <chrubis@suse.cz>
+ */
+
+#include <stdio.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+
+static void print_help(const char *name)
+{
+	fprintf(stderr, "usage: %s timeout pid\n", name);
+}
+
+#define print_msg(fmt, ...) fprintf(stderr, fmt "\n", ##__VA_ARGS__)
+
+int main(int argc, char *argv[])
+{
+	int timeout, pid, ret, i;
+
+	if (argc != 3) {
+		print_help(argv[0]);
+		return 1;
+	}
+
+	timeout = atoi(argv[1]);
+	pid = atoi(argv[2]);
+
+	if (timeout < 0) {
+		fprintf(stderr, "Invalid timeout '%s'\n", argv[1]);
+		print_help(argv[0]);
+		return 1;
+	}
+
+	if (pid <= 1) {
+		fprintf(stderr, "Invalid pid '%s'\n", argv[2]);
+		print_help(argv[0]);
+		return 1;
+	}
+
+	if (timeout)
+		sleep(timeout);
+
+	print_msg("Test timed out, sending SIGTERM!");
+	print_msg("If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1");
+
+	ret = kill(-pid, SIGTERM);
+	if (ret) {
+		print_msg("kill(%i) failed: %s", -pid, strerror(errno));
+		return 1;
+	}
+
+	usleep(100000);
+
+	i = 10;
+
+	while (!kill(-pid, 0) && i-- > 0) {
+		print_msg("Test is still running...");
+		sleep(1);
+	}
+
+	if (!kill(-pid, 0)) {
+		print_msg("Test is still running, sending SIGKILL");
+		ret = kill(-pid, SIGKILL);
+		if (ret) {
+			print_msg("kill(%i) failed: %s", -pid, strerror(errno));
+			return 1;
+		}
+	}
+
+	return 0;
+}