diff mbox series

[v3,3/4] lib: ignore SIGINT in _tst_kill_test

Message ID 20210508055109.16914-4-liwang@redhat.com
State Accepted
Headers show
Series shell: Fix orphan timeout sleep processes | expand

Commit Message

Li Wang May 8, 2021, 5:51 a.m. UTC
We have to guarantee _tst_kill_test alive for a while to check if
the target test eixst or not, so ignore SIGINT in _tst_kill_test
is necessary, otherwise it will be stopped by the SIGINT sending
by itself.

The timeout03.sh verify this mechanism proccess well in output:

  timeout03 1 TBROK: Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
  timeout03 1 TBROK: test interrupted or timed out
  timeout03 1 TPASS: test run cleanup after timeout
  timeout03 1 TINFO: Test is still running, waiting 10s
  timeout03 1 TINFO: Test is still running, waiting 9s
  timeout03 1 TINFO: Test is still running, waiting 8s
  timeout03 1 TINFO: Test is still running, waiting 7s
  timeout03 1 TINFO: Test is still running, waiting 6s
  timeout03 1 TINFO: Test is still running, waiting 5s
  timeout03 1 TINFO: Test is still running, waiting 4s
  timeout03 1 TINFO: Test is still running, waiting 3s
  timeout03 1 TINFO: Test is still running, waiting 2s
  timeout03 1 TINFO: Test is still running, waiting 1s
  timeout03 1 TBROK: Test still running, sending SIGKILL
  Killed

(This one based on Joerg's improvement work, better apply this after the two patches)

Fixes: 25ad54dbaea6("tst_test.sh: Run cleanup also after test timeout")
Signed-off-by: Li Wang <liwang@redhat.com>
Cc: Joerg Vehlow <lkml@jv-coder.de>
Acked-by: Petr Vorel <pvorel@suse.cz>
---
 testcases/lib/tst_test.sh | 1 +
 1 file changed, 1 insertion(+)

Comments

Cyril Hrubis May 10, 2021, 11:47 a.m. UTC | #1
Hi!
Reviewed-by: Cyril Hrubis <chrubis@suse.cz>
Joerg Vehlow May 11, 2021, 5:54 a.m. UTC | #2
Hi Li,

first of all thanks for fixing my patchset and getting it merged.

On 5/8/2021 7:51 AM, Li Wang wrote:
> We have to guarantee _tst_kill_test alive for a while to check if
> the target test eixst or not, so ignore SIGINT in _tst_kill_test
> is necessary, otherwise it will be stopped by the SIGINT sending
> by itself.
>
> The timeout03.sh verify this mechanism proccess well in output:
>
>    timeout03 1 TBROK: Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
>    timeout03 1 TBROK: test interrupted or timed out
>    timeout03 1 TPASS: test run cleanup after timeout
>    timeout03 1 TINFO: Test is still running, waiting 10s
>    timeout03 1 TINFO: Test is still running, waiting 9s
>    timeout03 1 TINFO: Test is still running, waiting 8s
>    timeout03 1 TINFO: Test is still running, waiting 7s
>    timeout03 1 TINFO: Test is still running, waiting 6s
>    timeout03 1 TINFO: Test is still running, waiting 5s
>    timeout03 1 TINFO: Test is still running, waiting 4s
>    timeout03 1 TINFO: Test is still running, waiting 3s
>    timeout03 1 TINFO: Test is still running, waiting 2s
>    timeout03 1 TINFO: Test is still running, waiting 1s
>    timeout03 1 TBROK: Test still running, sending SIGKILL
>    Killed
At first I did bot understand the problem you found, because I tried 
with dash, busybox sh and zsh.
All three shells had no problem here. But then I tried with bash and it 
failed.

I wonder if this is a bug in bash or in the other shells. I guess 
sending the signal to the whole
process group should also send it to the process running _tst_kill_test.

I did some digging into this while writing this (see conclusion below 
for results only):
1. All shells have their own implementation of kill (compare <SHELL> -c 
kill with /usr/bin kill)
2. When replacing "just" kill in the script with /usr/bin/kill, it still 
only fails in bash.
3. zsh seems to ignore SIGINT, but it can be caught using trap. busybox 
sh, and dash can't even get it when trapped
4. zsh disables SIGINT by callling "trap '' INT" internally somehow. 
When resetting SIGINT to default behavior, it is the same as bash.

For zsh this seems to be default behavior for background processes, 
probably to prevent keyboard interruption by CTRL+C:
zsh -c "trap&"
trap -- '' INT
trap -- '' QUIT

zsh -c "trap"
# No output



To conclude:
- bash does not seem to care about SIGINT delivery to background 
processes, but can be blocked using trap
- zsh ignores SIGINT for background processes by default, but can be 
allowed using trap
- dash and busybox sh ignore the signal to background processes, and 
this cannot be changed with trap

I tried with the following snippets:
<SHELL> -c 'trap "echo trap;" INT; (sleep 2; echo end sub) & sleep 1; 
kill -INT -$$; echo end main'
<SHELL> -c 'trap "echo trap;" INT; (trap - SIGINT sleep 2; echo end sub) 
& sleep 1; kill -INT -$$; echo end main'
<SHELL> -c 'trap "echo trap;" INT; (trap "exit" SIGINT sleep 2; echo end 
sub) & sleep 1; kill -INT -$$; echo end main'

SIGINT handling for child processes is strange. This might have some 
implication for the shell tests,
because it is possible, that SIGINT is not delivered to all processes 
and some may reside as orphans.
Since this can happen only in case of timeouts, I guess there is no real 
Problem.

A possible fix could be using SIGTERM instead of SIGINT. This signal 
does not seem to have some "intelligent" handling for background processes.
I do not know why LTP used SIGINT in the first place. My first thought 
would have been to use SIGTERM.  It is the way to "politely ask 
processes to terminate"

Jörg
Li Wang May 12, 2021, 9:49 a.m. UTC | #3
Hi Joerg,

On Tue, May 11, 2021 at 1:52 PM Joerg Vehlow <lkml@jv-coder.de> wrote:
>
> Hi Li,
>
> first of all thanks for fixing my patchset and getting it merged.
>
> On 5/8/2021 7:51 AM, Li Wang wrote:
> > We have to guarantee _tst_kill_test alive for a while to check if
> > the target test eixst or not, so ignore SIGINT in _tst_kill_test
> > is necessary, otherwise it will be stopped by the SIGINT sending
> > by itself.
> >
> > The timeout03.sh verify this mechanism proccess well in output:
> >
> >    timeout03 1 TBROK: Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
> >    timeout03 1 TBROK: test interrupted or timed out
> >    timeout03 1 TPASS: test run cleanup after timeout
> >    timeout03 1 TINFO: Test is still running, waiting 10s
> >    timeout03 1 TINFO: Test is still running, waiting 9s
> >    timeout03 1 TINFO: Test is still running, waiting 8s
> >    timeout03 1 TINFO: Test is still running, waiting 7s
> >    timeout03 1 TINFO: Test is still running, waiting 6s
> >    timeout03 1 TINFO: Test is still running, waiting 5s
> >    timeout03 1 TINFO: Test is still running, waiting 4s
> >    timeout03 1 TINFO: Test is still running, waiting 3s
> >    timeout03 1 TINFO: Test is still running, waiting 2s
> >    timeout03 1 TINFO: Test is still running, waiting 1s
> >    timeout03 1 TBROK: Test still running, sending SIGKILL
> >    Killed
> At first I did bot understand the problem you found, because I tried
> with dash, busybox sh and zsh.
> All three shells had no problem here. But then I tried with bash and it
> failed.
>
> I wonder if this is a bug in bash or in the other shells. I guess
> sending the signal to the whole
> process group should also send it to the process running _tst_kill_test.
>
> I did some digging into this while writing this (see conclusion below
> for results only):
> 1. All shells have their own implementation of kill (compare <SHELL> -c
> kill with /usr/bin kill)
> 2. When replacing "just" kill in the script with /usr/bin/kill, it still
> only fails in bash.
> 3. zsh seems to ignore SIGINT, but it can be caught using trap. busybox
> sh, and dash can't even get it when trapped
> 4. zsh disables SIGINT by callling "trap '' INT" internally somehow.
> When resetting SIGINT to default behavior, it is the same as bash.
>
> For zsh this seems to be default behavior for background processes,
> probably to prevent keyboard interruption by CTRL+C:
> zsh -c "trap&"
> trap -- '' INT
> trap -- '' QUIT
>
> zsh -c "trap"
> # No output
>
>
>
> To conclude:
> - bash does not seem to care about SIGINT delivery to background
> processes, but can be blocked using trap
> - zsh ignores SIGINT for background processes by default, but can be
> allowed using trap
> - dash and busybox sh ignore the signal to background processes, and
> this cannot be changed with trap
>
> I tried with the following snippets:
> <SHELL> -c 'trap "echo trap;" INT; (sleep 2; echo end sub) & sleep 1;
> kill -INT -$$; echo end main'
> <SHELL> -c 'trap "echo trap;" INT; (trap - SIGINT sleep 2; echo end sub)
> & sleep 1; kill -INT -$$; echo end main'
> <SHELL> -c 'trap "echo trap;" INT; (trap "exit" SIGINT sleep 2; echo end
> sub) & sleep 1; kill -INT -$$; echo end main'
>

Thanks for the demos above, it shows the difference clearly.

> SIGINT handling for child processes is strange. This might have some
> implication for the shell tests,
> because it is possible, that SIGINT is not delivered to all processes
> and some may reside as orphans.
> Since this can happen only in case of timeouts, I guess there is no real
> Problem.

Yes.

Looks like the behaviors on signal 'SIGINT' are not unify in background
processes handling for different SHELL. So as you said that using SIGINT
seems NOT a good idea to stop the process in timeout.

>
> A possible fix could be using SIGTERM instead of SIGINT. This signal
> does not seem to have some "intelligent" handling for background processes.

I agree. Can you make a patch to replace that INT?

(and this is only a timeout issue, so patch merging may be delayed due
to LTP new release)

> I do not know why LTP used SIGINT in the first place. My first thought
> would have been to use SIGTERM.  It is the way to "politely ask
> processes to terminate"

Yes, but that not strange to me, the possible reason is just to
stop(ctrl ^c) the LTP test manually for debugging, so we went
too far for using SIGINT but forget the original purpose :).
Joerg Vehlow May 12, 2021, 10:34 a.m. UTC | #4
Hi Li,
>> A possible fix could be using SIGTERM instead of SIGINT. This signal
>> does not seem to have some "intelligent" handling for background processes.
> I agree. Can you make a patch to replace that INT?
>
> (and this is only a timeout issue, so patch merging may be delayed due
> to LTP new release)
I'd like to supply the patch, I've placed it on my todo list. I will 
probably not finish it before the release,
but since it will probably not be included anyway, it doesn't matter.
I do not know why LTP used SIGINT in the first place. My first thought
>> would have been to use SIGTERM.  It is the way to "politely ask
>> processes to terminate"
> Yes, but that not strange to me, the possible reason is just to
> stop(ctrl ^c) the LTP test manually for debugging, so we went
> too far for using SIGINT but forget the original purpose :).
Ok sounds reasonable. The nice think about changing timeout to SIGTERM 
would be,
that abort using CTRL+C is clearly distinguishable from a timeout.

Jörg
Petr Vorel May 12, 2021, 2:20 p.m. UTC | #5
Hi all,

...
> > To conclude:
> > - bash does not seem to care about SIGINT delivery to background
> > processes, but can be blocked using trap
> > - zsh ignores SIGINT for background processes by default, but can be
> > allowed using trap
> > - dash and busybox sh ignore the signal to background processes, and
> > this cannot be changed with trap

> > I tried with the following snippets:
> > <SHELL> -c 'trap "echo trap;" INT; (sleep 2; echo end sub) & sleep 1;
> > kill -INT -$$; echo end main'
> > <SHELL> -c 'trap "echo trap;" INT; (trap - SIGINT sleep 2; echo end sub)
> > & sleep 1; kill -INT -$$; echo end main'
> > <SHELL> -c 'trap "echo trap;" INT; (trap "exit" SIGINT sleep 2; echo end
> > sub) & sleep 1; kill -INT -$$; echo end main'
FYI (you probably know it) SIGINT is a bashism, INT needs to be used.
$ kill -s SIGINT $$
dash: 2: kill: invalid signal number or name: SIGINT

> Thanks for the demos above, it shows the difference clearly.

> > SIGINT handling for child processes is strange. This might have some
> > implication for the shell tests,
> > because it is possible, that SIGINT is not delivered to all processes
> > and some may reside as orphans.
> > Since this can happen only in case of timeouts, I guess there is no real
> > Problem.

> Yes.

> Looks like the behaviors on signal 'SIGINT' are not unify in background
> processes handling for different SHELL. So as you said that using SIGINT
> seems NOT a good idea to stop the process in timeout.

Yes, trap looks to be having some portability issues [1] [2]

> > A possible fix could be using SIGTERM instead of SIGINT. This signal
> > does not seem to have some "intelligent" handling for background processes.

> I agree. Can you make a patch to replace that INT?

> (and this is only a timeout issue, so patch merging may be delayed due
> to LTP new release)

> > I do not know why LTP used SIGINT in the first place. My first thought
> > would have been to use SIGTERM.  It is the way to "politely ask
> > processes to terminate"

> Yes, but that not strange to me, the possible reason is just to
> stop(ctrl ^c) the LTP test manually for debugging, so we went
> too far for using SIGINT but forget the original purpose :).
I'm not the author, but yes, SIGINT was used for stopping with ctrl^c during
debugging.

FYI I tried to use both SIGINT and SIGTERM, but there was some problem.
But I guess it was my error. Please *add* SIGTERM (keep SIGINT).

Kind regards,
Petr

[1] https://unix.stackexchange.com/questions/240723/exit-trap-in-dash-vs-ksh-and-bash/240736#240736
[2] https://stackoverflow.com/questions/27012762/is-trap-exit-required-to-execute-in-case-of-sigint-or-sigterm-received
Li Wang May 13, 2021, 5:08 a.m. UTC | #6
> FYI I tried to use both SIGINT and SIGTERM, but there was some problem.
> But I guess it was my error. Please *add* SIGTERM (keep SIGINT).

Yes, we'd better keep SIGINT for ctrl^c action and use SIGTERM
additionally for process terminating.

Does this below (rough solution in my mind) work for you?

diff --git a/lib/newlib_tests/shell/timeout03.sh
b/lib/newlib_tests/shell/timeout03.sh
index cd548d9a2..f39f5712a 100755
--- a/lib/newlib_tests/shell/timeout03.sh
+++ b/lib/newlib_tests/shell/timeout03.sh
@@ -30,6 +30,7 @@ TST_TIMEOUT=1

 do_test()
 {
+       trap "echo 'Sorry, timeout03 is still alive'" TERM
        tst_res TINFO "testing killing test after TST_TIMEOUT"

        sleep 2
diff --git a/testcases/lib/tst_test.sh b/testcases/lib/tst_test.sh
index 28c2052d6..d7c9791e9 100644
--- a/testcases/lib/tst_test.sh
+++ b/testcases/lib/tst_test.sh
@@ -21,7 +21,7 @@ export TST_LIB_LOADED=1
 . tst_security.sh

 # default trap function
-trap "tst_brk TBROK 'test interrupted or timed out'" INT
+trap "tst_brk TBROK 'test interrupted'" INT

 _tst_do_exit()
 {
@@ -439,18 +439,18 @@ _tst_kill_test()
 {
        local i=10

-       trap '' INT
-       tst_res TBROK "Test timeouted, sending SIGINT! If you are
running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
-       kill -INT -$pid
+       trap '' TERM
+       tst_res TBROK "Test timeouted, sending SIGTERM! If you are
running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
+       kill -TERM -$pid
        tst_sleep 100ms

-       while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do
+       while kill -0 $pid &>/dev/null && [ $i -gt 0 ]; do
                tst_res TINFO "Test is still running, waiting ${i}s"
                sleep 1
                i=$((i-1))
        done

-       if kill -0 $pid 2>&1 > /dev/null; then
+       if kill -0 $pid &>/dev/null; then
                tst_res TBROK "Test still running, sending SIGKILL"
                kill -KILL -$pid
        fi
Petr Vorel May 14, 2021, 7:53 a.m. UTC | #7
> > FYI I tried to use both SIGINT and SIGTERM, but there was some problem.
> > But I guess it was my error. Please *add* SIGTERM (keep SIGINT).

> Yes, we'd better keep SIGINT for ctrl^c action and use SIGTERM
> additionally for process terminating.

> Does this below (rough solution in my mind) work for you?
LGTM, but Joerg, Metan, could you please have a look?

> diff --git a/lib/newlib_tests/shell/timeout03.sh
> b/lib/newlib_tests/shell/timeout03.sh
> index cd548d9a2..f39f5712a 100755
> --- a/lib/newlib_tests/shell/timeout03.sh
> +++ b/lib/newlib_tests/shell/timeout03.sh
> @@ -30,6 +30,7 @@ TST_TIMEOUT=1

>  do_test()
>  {
> +       trap "echo 'Sorry, timeout03 is still alive'" TERM
Any reason why not use tst_res TINFO? (working on bash).
>         tst_res TINFO "testing killing test after TST_TIMEOUT"

>         sleep 2
> diff --git a/testcases/lib/tst_test.sh b/testcases/lib/tst_test.sh
> index 28c2052d6..d7c9791e9 100644
> --- a/testcases/lib/tst_test.sh
> +++ b/testcases/lib/tst_test.sh
> @@ -21,7 +21,7 @@ export TST_LIB_LOADED=1
>  . tst_security.sh

>  # default trap function
> -trap "tst_brk TBROK 'test interrupted or timed out'" INT
> +trap "tst_brk TBROK 'test interrupted'" INT

>  _tst_do_exit()
>  {
> @@ -439,18 +439,18 @@ _tst_kill_test()
>  {
>         local i=10

> -       trap '' INT
> -       tst_res TBROK "Test timeouted, sending SIGINT! If you are
> running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
> -       kill -INT -$pid
> +       trap '' TERM
> +       tst_res TBROK "Test timeouted, sending SIGTERM! If you are
> running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
> +       kill -TERM -$pid
>         tst_sleep 100ms

> -       while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do
> +       while kill -0 $pid &>/dev/null && [ $i -gt 0 ]; do
FYI: &> is a bashism (we need to keep the original).
>                 tst_res TINFO "Test is still running, waiting ${i}s"
>                 sleep 1
>                 i=$((i-1))
>         done

> -       if kill -0 $pid 2>&1 > /dev/null; then
> +       if kill -0 $pid &>/dev/null; then
And here as well.

>                 tst_res TBROK "Test still running, sending SIGKILL"
>                 kill -KILL -$pid
>         fi

Kind regards,
Petr
Li Wang May 14, 2021, 9:10 a.m. UTC | #8
Petr Vorel <pvorel@suse.cz> wrote:

> > Does this below (rough solution in my mind) work for you?
> LGTM, but Joerg, Metan, could you please have a look?

Thanks, I wouldn't send a patch until Joerg/Cyril has a review.
(Maybe Joerg will have another better solution:)

> > diff --git a/lib/newlib_tests/shell/timeout03.sh
> > b/lib/newlib_tests/shell/timeout03.sh
> > index cd548d9a2..f39f5712a 100755
> > --- a/lib/newlib_tests/shell/timeout03.sh
> > +++ b/lib/newlib_tests/shell/timeout03.sh
> > @@ -30,6 +30,7 @@ TST_TIMEOUT=1
>
> >  do_test()
> >  {
> > +       trap "echo 'Sorry, timeout03 is still alive'" TERM
> Any reason why not use tst_res TINFO? (working on bash).

+1
I just use echo for a quick test, but surely we can replace it with tst_res.


> > -       while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do
> > +       while kill -0 $pid &>/dev/null && [ $i -gt 0 ]; do
> FYI: &> is a bashism (we need to keep the original).

I just want the error does not to redirect to standard output.
Because with SIGTERM sending, it seems easier to kill all
processes, so 'kill -0 $pid' returns "No such process" errors often.

Mayby I should go with: kill -0 $pid >/dev/null 2>&1

e.g.

# ./timeout04.sh
timeout04 1 TINFO: timeout per run is 0h 0m 1s
timeout04 1 TINFO: Start
timeout04 1 TBROK: Test timeouted, sending SIGTERM! If you are running
on slow machine, try exporting LTP_TIMEOUT_MUL > 1
Terminated
./../../../testcases/lib/tst_test.sh: line 448: kill: (74911) - No such process
./../../../testcases/lib/tst_test.sh: line 454: kill: (74911) - No such process
Petr Vorel May 14, 2021, 3:10 p.m. UTC | #9
Hi Li, Joerg, Cyril,

> Petr Vorel <pvorel@suse.cz> wrote:

> > > Does this below (rough solution in my mind) work for you?
> > LGTM, but Joerg, Metan, could you please have a look?

> Thanks, I wouldn't send a patch until Joerg/Cyril has a review.
> (Maybe Joerg will have another better solution:)
Understand.

> > > diff --git a/lib/newlib_tests/shell/timeout03.sh
> > > b/lib/newlib_tests/shell/timeout03.sh
> > > index cd548d9a2..f39f5712a 100755
> > > --- a/lib/newlib_tests/shell/timeout03.sh
> > > +++ b/lib/newlib_tests/shell/timeout03.sh
> > > @@ -30,6 +30,7 @@ TST_TIMEOUT=1

> > >  do_test()
> > >  {
> > > +       trap "echo 'Sorry, timeout03 is still alive'" TERM
> > Any reason why not use tst_res TINFO? (working on bash).

> +1
> I just use echo for a quick test, but surely we can replace it with tst_res.
+1


> > > -       while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do
> > > +       while kill -0 $pid &>/dev/null && [ $i -gt 0 ]; do
> > FYI: &> is a bashism (we need to keep the original).

> I just want the error does not to redirect to standard output.
> Because with SIGTERM sending, it seems easier to kill all
> processes, so 'kill -0 $pid' returns "No such process" errors often.

> Mayby I should go with: kill -0 $pid >/dev/null 2>&1
+1. Now I see it, "2>&1 > /dev/null" is wrong, it must be vice versa.

Kind regards,
Petr

> e.g.

> # ./timeout04.sh
> timeout04 1 TINFO: timeout per run is 0h 0m 1s
> timeout04 1 TINFO: Start
> timeout04 1 TBROK: Test timeouted, sending SIGTERM! If you are running
> on slow machine, try exporting LTP_TIMEOUT_MUL > 1
> Terminated
> ./../../../testcases/lib/tst_test.sh: line 448: kill: (74911) - No such process
> ./../../../testcases/lib/tst_test.sh: line 454: kill: (74911) - No such process
Joerg Vehlow May 18, 2021, 4:57 a.m. UTC | #10
Hi Li,

On 5/13/2021 7:08 AM, Li Wang wrote:
>> FYI I tried to use both SIGINT and SIGTERM, but there was some problem.
>> But I guess it was my error. Please *add* SIGTERM (keep SIGINT).
> Yes, we'd better keep SIGINT for ctrl^c action and use SIGTERM
> additionally for process terminating.
>
> Does this below (rough solution in my mind) work for you?
>
> diff --git a/lib/newlib_tests/shell/timeout03.sh
> b/lib/newlib_tests/shell/timeout03.sh
> index cd548d9a2..f39f5712a 100755
> --- a/lib/newlib_tests/shell/timeout03.sh
> +++ b/lib/newlib_tests/shell/timeout03.sh
> @@ -30,6 +30,7 @@ TST_TIMEOUT=1
>
>   do_test()
>   {
> +       trap "echo 'Sorry, timeout03 is still alive'" TERM
>          tst_res TINFO "testing killing test after TST_TIMEOUT"
>
>          sleep 2
> diff --git a/testcases/lib/tst_test.sh b/testcases/lib/tst_test.sh
> index 28c2052d6..d7c9791e9 100644
> --- a/testcases/lib/tst_test.sh
> +++ b/testcases/lib/tst_test.sh
> @@ -21,7 +21,7 @@ export TST_LIB_LOADED=1
>   . tst_security.sh
>
>   # default trap function
> -trap "tst_brk TBROK 'test interrupted or timed out'" INT
> +trap "tst_brk TBROK 'test interrupted'" INT
This would require something like
trap "tst_brk TBROK 'test terminated'" TERM
or
trap "_tst_do_exit" TERM

Otherwise the test is terminated very roughly, without executing 
cleanup, which is probably not a good idea.

But that introduces the next problem: A short deadlock between 
_tst_kill_test and _tst_cleanup_timer,
because _tst_cleanup_timer waits for the termination of the timeout 
process and vice versa.
Another problem is, that a SIGTERM originating from some other location 
could look like a timeout.

I am currently thinking about the following solution, to mitigate most 
problems:
The timeout process sends SIGUSR1 (or maybe SIGALRM?) only to the main 
test process and blocks TERM.
The main process can print, that it ran into a timeout, send a sigterm 
to its processs group (while ignoring TERM itself).
Then it can unset $_tst_setup_timer_pid safely, because it knows it was 
triggered by the timeout process and execute _tst_do_exit.

If the timeout process does not see the termination of the main process, 
it can still send SIGKILL to the whole process group.

Jörg
Li Wang May 18, 2021, 7:27 a.m. UTC | #11
Hi Joerg,

> > -trap "tst_brk TBROK 'test interrupted or timed out'" INT
> > +trap "tst_brk TBROK 'test interrupted'" INT
> This would require something like
> trap "tst_brk TBROK 'test terminated'" TERM
> or
> trap "_tst_do_exit" TERM
>
> Otherwise the test is terminated very roughly, without executing
> cleanup, which is probably not a good idea.

Yes, seems I didn't realize this needs cleanup as well.

But I'd still suggest keeping SIGINT here for catching Ctrl^C for users :).

>
> But that introduces the next problem: A short deadlock between
> _tst_kill_test and _tst_cleanup_timer,
> because _tst_cleanup_timer waits for the termination of the timeout
> process and vice versa.
> Another problem is, that a SIGTERM originating from some other location
> could look like a timeout.

Yes, and that's the reason why I didn't trap SIGTERM simply in the main process.


> I am currently thinking about the following solution, to mitigate most
> problems:
> The timeout process sends SIGUSR1 (or maybe SIGALRM?) only to the main
> test process and blocks TERM.
> The main process can print, that it ran into a timeout, send a sigterm
> to its processs group (while ignoring TERM itself).
> Then it can unset $_tst_setup_timer_pid safely, because it knows it was
> triggered by the timeout process and execute _tst_do_exit.
>
> If the timeout process does not see the termination of the main process,
> it can still send SIGKILL to the whole process group.


It probably will be work but looks a bit confusing since that involves
more signals.

In conclusion, I think we maybe have such situations to be solved:

1. SIGINT (Ctrl^C) for terminating the main process and do cleanup
correctly before a timeout
2. Test finish normally and retrieves the _tst_timeout_process in the
background via SIGTERM(sending by _tst_cleanup_timer)
3. Test timeout occurs and _tst_kill_test sending SIGTERM to
terminating all process, and the main process do cleanup work
4. Test timeout occurs but still have process alive after
_tst_kill_test sending SIGTERM, then sending SIGKILL to the whole
group

So, I'm now thinking can we just introduce a knob(variable) for skipping
the _tst_cleanup_timer works in timeout mode, then it will not have a
deadlock anymore.

How about:

--- a/testcases/lib/tst_test.sh
+++ b/testcases/lib/tst_test.sh
@@ -16,12 +16,14 @@ export TST_COUNT=1
 export TST_ITERATIONS=1
 export TST_TMPDIR_RHOST=0
 export TST_LIB_LOADED=1
+export TST_TIMEOUT_OCCUR=0

 . tst_ansi_color.sh
 . tst_security.sh

 # default trap function
-trap "tst_brk TBROK 'test interrupted or timed out'" INT
+trap "tst_brk TBROK 'test interrupted'" INT
+trap "TST_TIMEOUT_OCCUR=1; tst_brk TBROK 'test timeouted'" TERM

 _tst_do_exit()
 {
@@ -48,7 +50,9 @@ _tst_do_exit()
                [ "$TST_TMPDIR_RHOST" = 1 ] && tst_cleanup_rhost
        fi

-       _tst_cleanup_timer
+       if ["$TST_TIMEOUT_OCCUR" = 0 ]; then
+               _tst_cleanup_timer
+       fi

        if [ $TST_FAIL -gt 0 ]; then
                ret=$((ret|1))
@@ -439,18 +443,18 @@ _tst_kill_test()
 {
        local i=10

-       trap '' INT
-       tst_res TBROK "Test timeouted, sending SIGINT! If you are
running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
-       kill -INT -$pid
+       trap '' TERM
+       tst_res TBROK "Test timeouted, sending SIGTERM! If you are
running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
+       kill -TERM -$pid
        tst_sleep 100ms

-       while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do
+       while kill -0 $pid >/dev/null 2>&1 && [ $i -gt 0 ]; do
                tst_res TINFO "Test is still running, waiting ${i}s"
                sleep 1
                i=$((i-1))
        done

-       if kill -0 $pid 2>&1 > /dev/null; then
+       if kill -0 $pid >/dev/null 2>&1; then
                tst_res TBROK "Test still running, sending SIGKILL"
                kill -KILL -$pid
        fi


--
Regards,
Li Wang
Petr Vorel May 18, 2021, 8:46 a.m. UTC | #12
Hi all,

> Hi Joerg,

> > > -trap "tst_brk TBROK 'test interrupted or timed out'" INT
> > > +trap "tst_brk TBROK 'test interrupted'" INT
> > This would require something like
> > trap "tst_brk TBROK 'test terminated'" TERM
> > or
> > trap "_tst_do_exit" TERM

> > Otherwise the test is terminated very roughly, without executing
> > cleanup, which is probably not a good idea.
+1

> Yes, seems I didn't realize this needs cleanup as well.

> But I'd still suggest keeping SIGINT here for catching Ctrl^C for users :).
+1


> > But that introduces the next problem: A short deadlock between
> > _tst_kill_test and _tst_cleanup_timer,
> > because _tst_cleanup_timer waits for the termination of the timeout
> > process and vice versa.
> > Another problem is, that a SIGTERM originating from some other location
> > could look like a timeout.

> Yes, and that's the reason why I didn't trap SIGTERM simply in the main process.


> > I am currently thinking about the following solution, to mitigate most
> > problems:
> > The timeout process sends SIGUSR1 (or maybe SIGALRM?) only to the main
> > test process and blocks TERM.
> > The main process can print, that it ran into a timeout, send a sigterm
> > to its processs group (while ignoring TERM itself).
> > Then it can unset $_tst_setup_timer_pid safely, because it knows it was
> > triggered by the timeout process and execute _tst_do_exit.

> > If the timeout process does not see the termination of the main process,
> > it can still send SIGKILL to the whole process group.


> It probably will be work but looks a bit confusing since that involves
> more signals.

> In conclusion, I think we maybe have such situations to be solved:

> 1. SIGINT (Ctrl^C) for terminating the main process and do cleanup
> correctly before a timeout
> 2. Test finish normally and retrieves the _tst_timeout_process in the
> background via SIGTERM(sending by _tst_cleanup_timer)
> 3. Test timeout occurs and _tst_kill_test sending SIGTERM to
> terminating all process, and the main process do cleanup work
> 4. Test timeout occurs but still have process alive after
> _tst_kill_test sending SIGTERM, then sending SIGKILL to the whole
> group

> So, I'm now thinking can we just introduce a knob(variable) for skipping
> the _tst_cleanup_timer works in timeout mode, then it will not have a
> deadlock anymore.

+1

> How about:

I'm not sure if we're not getting too late for these changes. Because it'll be
just on us to test that (community probably have run these changes). But I'd
still prefer to fix it.

If we don't fix it, I'd be at least for fixing wrong redirection order (2>&1 >
/dev/null) introduced by me in 25ad54dba.

> --- a/testcases/lib/tst_test.sh
> +++ b/testcases/lib/tst_test.sh
> @@ -16,12 +16,14 @@ export TST_COUNT=1
>  export TST_ITERATIONS=1
>  export TST_TMPDIR_RHOST=0
>  export TST_LIB_LOADED=1
> +export TST_TIMEOUT_OCCUR=0

>  . tst_ansi_color.sh
>  . tst_security.sh

>  # default trap function
> -trap "tst_brk TBROK 'test interrupted or timed out'" INT
> +trap "tst_brk TBROK 'test interrupted'" INT
> +trap "TST_TIMEOUT_OCCUR=1; tst_brk TBROK 'test timeouted'" TERM

>  _tst_do_exit()
>  {
> @@ -48,7 +50,9 @@ _tst_do_exit()
>                 [ "$TST_TMPDIR_RHOST" = 1 ] && tst_cleanup_rhost
>         fi

> -       _tst_cleanup_timer
> +       if ["$TST_TIMEOUT_OCCUR" = 0 ]; then
> +               _tst_cleanup_timer
> +       fi

>         if [ $TST_FAIL -gt 0 ]; then
>                 ret=$((ret|1))
> @@ -439,18 +443,18 @@ _tst_kill_test()
>  {
>         local i=10

> -       trap '' INT
> -       tst_res TBROK "Test timeouted, sending SIGINT! If you are
> running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
> -       kill -INT -$pid
> +       trap '' TERM
> +       tst_res TBROK "Test timeouted, sending SIGTERM! If you are
> running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
> +       kill -TERM -$pid
>         tst_sleep 100ms

> -       while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do
> +       while kill -0 $pid >/dev/null 2>&1 && [ $i -gt 0 ]; do
>                 tst_res TINFO "Test is still running, waiting ${i}s"
>                 sleep 1
>                 i=$((i-1))
>         done

> -       if kill -0 $pid 2>&1 > /dev/null; then
> +       if kill -0 $pid >/dev/null 2>&1; then
>                 tst_res TBROK "Test still running, sending SIGKILL"
>                 kill -KILL -$pid
>         fi

LGTM, I'll try to test it today.

Kind regards,
Petr
Joerg Vehlow May 18, 2021, 9:45 a.m. UTC | #13
Hi,

On 5/18/2021 9:27 AM, Li Wang wrote:
> Hi Joerg,
>
>>> -trap "tst_brk TBROK 'test interrupted or timed out'" INT
>>> +trap "tst_brk TBROK 'test interrupted'" INT
>> This would require something like
>> trap "tst_brk TBROK 'test terminated'" TERM
>> or
>> trap "_tst_do_exit" TERM
>>
>> Otherwise the test is terminated very roughly, without executing
>> cleanup, which is probably not a good idea.
> Yes, seems I didn't realize this needs cleanup as well.
>
> But I'd still suggest keeping SIGINT here for catching Ctrl^C for users :).
+1, I never intended to remove thi
>> I am currently thinking about the following solution, to mitigate most
>> problems:
>> The timeout process sends SIGUSR1 (or maybe SIGALRM?) only to the main
>> test process and blocks TERM.
>> The main process can print, that it ran into a timeout, send a sigterm
>> to its processs group (while ignoring TERM itself).
>> Then it can unset $_tst_setup_timer_pid safely, because it knows it was
>> triggered by the timeout process and execute _tst_do_exit.
>>
>> If the timeout process does not see the termination of the main process,
>> it can still send SIGKILL to the whole process group.
>
> It probably will be work but looks a bit confusing since that involves
> more signals.
>
> In conclusion, I think we maybe have such situations to be solved:
>
> 1. SIGINT (Ctrl^C) for terminating the main process and do cleanup
> correctly before a timeout
> 2. Test finish normally and retrieves the _tst_timeout_process in the
> background via SIGTERM(sending by _tst_cleanup_timer)
> 3. Test timeout occurs and _tst_kill_test sending SIGTERM to
> terminating all process, and the main process do cleanup work
> 4. Test timeout occurs but still have process alive after
> _tst_kill_test sending SIGTERM, then sending SIGKILL to the whole
> group
>
> So, I'm now thinking can we just introduce a knob(variable) for skipping
> the _tst_cleanup_timer works in timeout mode, then it will not have a
> deadlock anymore.
This works of course and is the "simplest" solution, the only thing I do 
not like about this,
is the fact, that SIGTERM send by something else (e.g. system shoutdown 
or process manager),
is handled like timeouts are handled and reported as timeout. That's why 
I suggested introducing
a new signal. But since this is probably rare, I could live without it.


>
> How about:
>
> --- a/testcases/lib/tst_test.sh
> +++ b/testcases/lib/tst_test.sh
> @@ -16,12 +16,14 @@ export TST_COUNT=1
>   export TST_ITERATIONS=1
>   export TST_TMPDIR_RHOST=0
>   export TST_LIB_LOADED=1
> +export TST_TIMEOUT_OCCUR=0
>
>   . tst_ansi_color.sh
>   . tst_security.sh
>
>   # default trap function
> -trap "tst_brk TBROK 'test interrupted or timed out'" INT
> +trap "tst_brk TBROK 'test interrupted'" INT
> +trap "TST_TIMEOUT_OCCUR=1; tst_brk TBROK 'test timeouted'" TERM
This could also be done by "unset _tst_setup_timer_pid" or 
'_tst_setup_timer_pid=""'.
I guess even if a new variable is introduced, it should start with an _, 
because it is supposed to be internal to the framework?


>
>   _tst_do_exit()
>   {
> @@ -48,7 +50,9 @@ _tst_do_exit()
>                  [ "$TST_TMPDIR_RHOST" = 1 ] && tst_cleanup_rhost
>          fi
>
> -       _tst_cleanup_timer
> +       if ["$TST_TIMEOUT_OCCUR" = 0 ]; then
> +               _tst_cleanup_timer
> +       fi
>
>          if [ $TST_FAIL -gt 0 ]; then
>                  ret=$((ret|1))
> @@ -439,18 +443,18 @@ _tst_kill_test()
>   {
>          local i=10
>
> -       trap '' INT
> -       tst_res TBROK "Test timeouted, sending SIGINT! If you are
> running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
> -       kill -INT -$pid
> +       trap '' TERM
> +       tst_res TBROK "Test timeouted, sending SIGTERM! If you are
> running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
If you post this as a patch, can you please fix "timeouted" => "timed out"?
There is no word "timeouted" in the english language.

> +       kill -TERM -$pid
>          tst_sleep 100ms
>
> -       while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do
> +       while kill -0 $pid >/dev/null 2>&1 && [ $i -gt 0 ]; do
>                  tst_res TINFO "Test is still running, waiting ${i}s"
>                  sleep 1
>                  i=$((i-1))
>          done
>
> -       if kill -0 $pid 2>&1 > /dev/null; then
> +       if kill -0 $pid >/dev/null 2>&1; then
>                  tst_res TBROK "Test still running, sending SIGKILL"
>                  kill -KILL -$pid
>          fi
>
>
> --
> Regards,
> Li Wang

@Petr
I wouldn't recommend getting the fix into the release.
The problem is nothing new and does not fix a "real issue" at the moment,
but has the risk of introducing something unexpected.
Fixing the output redirection could be done without a major risk, I guess.

Jörg
Li Wang May 18, 2021, 12:01 p.m. UTC | #14
Hi all,

> > In conclusion, I think we maybe have such situations to be solved:
> >
> > 1. SIGINT (Ctrl^C) for terminating the main process and do cleanup
> > correctly before a timeout
> > 2. Test finish normally and retrieves the _tst_timeout_process in the
> > background via SIGTERM(sending by _tst_cleanup_timer)
> > 3. Test timeout occurs and _tst_kill_test sending SIGTERM to
> > terminating all process, and the main process do cleanup work
> > 4. Test timeout occurs but still have process alive after
> > _tst_kill_test sending SIGTERM, then sending SIGKILL to the whole
> > group
> >
> > So, I'm now thinking can we just introduce a knob(variable) for skipping
> > the _tst_cleanup_timer works in timeout mode, then it will not have a
> > deadlock anymore.
> This works of course and is the "simplest" solution, the only thing I do
> not like about this,
> is the fact, that SIGTERM send by something else (e.g. system shoutdown
> or process manager),
> is handled like timeouts are handled and reported as timeout. That's why
> I suggested introducing
> a new signal. But since this is probably rare, I could live without it.

Hmm, it wouldn't handle/report like a time-out if we break with "test
terminated"
output for a SIGTERM. If we do

     trap "unset _tst_setup_timer_pid; tst_brk TBROK 'test terminated'" TERM

in the main process, system will still send SIGTERM to the _tst_timeout_process
when shutting down, and the _tst_kill_test will never be called in that case.

>
>
> >
> > How about:
> >
> > --- a/testcases/lib/tst_test.sh
> > +++ b/testcases/lib/tst_test.sh
> > @@ -16,12 +16,14 @@ export TST_COUNT=1
> >   export TST_ITERATIONS=1
> >   export TST_TMPDIR_RHOST=0
> >   export TST_LIB_LOADED=1
> > +export TST_TIMEOUT_OCCUR=0
> >
> >   . tst_ansi_color.sh
> >   . tst_security.sh
> >
> >   # default trap function
> > -trap "tst_brk TBROK 'test interrupted or timed out'" INT
> > +trap "tst_brk TBROK 'test interrupted'" INT
> > +trap "TST_TIMEOUT_OCCUR=1; tst_brk TBROK 'test timeouted'" TERM
> This could also be done by "unset _tst_setup_timer_pid" or
> '_tst_setup_timer_pid=""'.

+1, 'unset _tst_setup_timer_pid' is a good idea, sorry I was blind
here when reading your previous email:).

> I guess even if a new variable is introduced, it should start with an _,
> because it is supposed to be internal to the framework?

Yes, but let's go with "unset _tst_setup_timer_pid" but not introduce
a new variable.

>
>
> >
> >   _tst_do_exit()
> >   {
> > @@ -48,7 +50,9 @@ _tst_do_exit()
> >                  [ "$TST_TMPDIR_RHOST" = 1 ] && tst_cleanup_rhost
> >          fi
> >
> > -       _tst_cleanup_timer
> > +       if ["$TST_TIMEOUT_OCCUR" = 0 ]; then
> > +               _tst_cleanup_timer
> > +       fi
> >
> >          if [ $TST_FAIL -gt 0 ]; then
> >                  ret=$((ret|1))
> > @@ -439,18 +443,18 @@ _tst_kill_test()
> >   {
> >          local i=10
> >
> > -       trap '' INT
> > -       tst_res TBROK "Test timeouted, sending SIGINT! If you are
> > running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
> > -       kill -INT -$pid
> > +       trap '' TERM
> > +       tst_res TBROK "Test timeouted, sending SIGTERM! If you are
> > running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
> If you post this as a patch, can you please fix "timeouted" => "timed out"?
> There is no word "timeouted" in the english language.

Sure. Thanks for your strict attitude on syntax.


> @Petr
> I wouldn't recommend getting the fix into the release.
> The problem is nothing new and does not fix a "real issue" at the moment,
> but has the risk of introducing something unexpected.
> Fixing the output redirection could be done without a major risk, I guess.

I will split the fix into two-part, one for errors redirection,
another for SIGTERM using.

Thanks for your review!


--
Regards,
Li Wang
diff mbox series

Patch

diff --git a/testcases/lib/tst_test.sh b/testcases/lib/tst_test.sh
index ed2699175..b6ca0cb26 100644
--- a/testcases/lib/tst_test.sh
+++ b/testcases/lib/tst_test.sh
@@ -439,6 +439,7 @@  _tst_kill_test()
 {
 	local i=10
 
+	trap '' INT
 	tst_res TBROK "Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1"
 	kill -INT -$pid
 	tst_sleep 100ms