Message ID | 20210508055109.16914-4-liwang@redhat.com |
---|---|
State | Accepted |
Headers | show |
Series | shell: Fix orphan timeout sleep processes | expand |
Hi!
Reviewed-by: Cyril Hrubis <chrubis@suse.cz>
Hi Li, first of all thanks for fixing my patchset and getting it merged. On 5/8/2021 7:51 AM, Li Wang wrote: > We have to guarantee _tst_kill_test alive for a while to check if > the target test eixst or not, so ignore SIGINT in _tst_kill_test > is necessary, otherwise it will be stopped by the SIGINT sending > by itself. > > The timeout03.sh verify this mechanism proccess well in output: > > timeout03 1 TBROK: Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 > timeout03 1 TBROK: test interrupted or timed out > timeout03 1 TPASS: test run cleanup after timeout > timeout03 1 TINFO: Test is still running, waiting 10s > timeout03 1 TINFO: Test is still running, waiting 9s > timeout03 1 TINFO: Test is still running, waiting 8s > timeout03 1 TINFO: Test is still running, waiting 7s > timeout03 1 TINFO: Test is still running, waiting 6s > timeout03 1 TINFO: Test is still running, waiting 5s > timeout03 1 TINFO: Test is still running, waiting 4s > timeout03 1 TINFO: Test is still running, waiting 3s > timeout03 1 TINFO: Test is still running, waiting 2s > timeout03 1 TINFO: Test is still running, waiting 1s > timeout03 1 TBROK: Test still running, sending SIGKILL > Killed At first I did bot understand the problem you found, because I tried with dash, busybox sh and zsh. All three shells had no problem here. But then I tried with bash and it failed. I wonder if this is a bug in bash or in the other shells. I guess sending the signal to the whole process group should also send it to the process running _tst_kill_test. I did some digging into this while writing this (see conclusion below for results only): 1. All shells have their own implementation of kill (compare <SHELL> -c kill with /usr/bin kill) 2. When replacing "just" kill in the script with /usr/bin/kill, it still only fails in bash. 3. zsh seems to ignore SIGINT, but it can be caught using trap. busybox sh, and dash can't even get it when trapped 4. zsh disables SIGINT by callling "trap '' INT" internally somehow. When resetting SIGINT to default behavior, it is the same as bash. For zsh this seems to be default behavior for background processes, probably to prevent keyboard interruption by CTRL+C: zsh -c "trap&" trap -- '' INT trap -- '' QUIT zsh -c "trap" # No output To conclude: - bash does not seem to care about SIGINT delivery to background processes, but can be blocked using trap - zsh ignores SIGINT for background processes by default, but can be allowed using trap - dash and busybox sh ignore the signal to background processes, and this cannot be changed with trap I tried with the following snippets: <SHELL> -c 'trap "echo trap;" INT; (sleep 2; echo end sub) & sleep 1; kill -INT -$$; echo end main' <SHELL> -c 'trap "echo trap;" INT; (trap - SIGINT sleep 2; echo end sub) & sleep 1; kill -INT -$$; echo end main' <SHELL> -c 'trap "echo trap;" INT; (trap "exit" SIGINT sleep 2; echo end sub) & sleep 1; kill -INT -$$; echo end main' SIGINT handling for child processes is strange. This might have some implication for the shell tests, because it is possible, that SIGINT is not delivered to all processes and some may reside as orphans. Since this can happen only in case of timeouts, I guess there is no real Problem. A possible fix could be using SIGTERM instead of SIGINT. This signal does not seem to have some "intelligent" handling for background processes. I do not know why LTP used SIGINT in the first place. My first thought would have been to use SIGTERM. It is the way to "politely ask processes to terminate" Jörg
Hi Joerg, On Tue, May 11, 2021 at 1:52 PM Joerg Vehlow <lkml@jv-coder.de> wrote: > > Hi Li, > > first of all thanks for fixing my patchset and getting it merged. > > On 5/8/2021 7:51 AM, Li Wang wrote: > > We have to guarantee _tst_kill_test alive for a while to check if > > the target test eixst or not, so ignore SIGINT in _tst_kill_test > > is necessary, otherwise it will be stopped by the SIGINT sending > > by itself. > > > > The timeout03.sh verify this mechanism proccess well in output: > > > > timeout03 1 TBROK: Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 > > timeout03 1 TBROK: test interrupted or timed out > > timeout03 1 TPASS: test run cleanup after timeout > > timeout03 1 TINFO: Test is still running, waiting 10s > > timeout03 1 TINFO: Test is still running, waiting 9s > > timeout03 1 TINFO: Test is still running, waiting 8s > > timeout03 1 TINFO: Test is still running, waiting 7s > > timeout03 1 TINFO: Test is still running, waiting 6s > > timeout03 1 TINFO: Test is still running, waiting 5s > > timeout03 1 TINFO: Test is still running, waiting 4s > > timeout03 1 TINFO: Test is still running, waiting 3s > > timeout03 1 TINFO: Test is still running, waiting 2s > > timeout03 1 TINFO: Test is still running, waiting 1s > > timeout03 1 TBROK: Test still running, sending SIGKILL > > Killed > At first I did bot understand the problem you found, because I tried > with dash, busybox sh and zsh. > All three shells had no problem here. But then I tried with bash and it > failed. > > I wonder if this is a bug in bash or in the other shells. I guess > sending the signal to the whole > process group should also send it to the process running _tst_kill_test. > > I did some digging into this while writing this (see conclusion below > for results only): > 1. All shells have their own implementation of kill (compare <SHELL> -c > kill with /usr/bin kill) > 2. When replacing "just" kill in the script with /usr/bin/kill, it still > only fails in bash. > 3. zsh seems to ignore SIGINT, but it can be caught using trap. busybox > sh, and dash can't even get it when trapped > 4. zsh disables SIGINT by callling "trap '' INT" internally somehow. > When resetting SIGINT to default behavior, it is the same as bash. > > For zsh this seems to be default behavior for background processes, > probably to prevent keyboard interruption by CTRL+C: > zsh -c "trap&" > trap -- '' INT > trap -- '' QUIT > > zsh -c "trap" > # No output > > > > To conclude: > - bash does not seem to care about SIGINT delivery to background > processes, but can be blocked using trap > - zsh ignores SIGINT for background processes by default, but can be > allowed using trap > - dash and busybox sh ignore the signal to background processes, and > this cannot be changed with trap > > I tried with the following snippets: > <SHELL> -c 'trap "echo trap;" INT; (sleep 2; echo end sub) & sleep 1; > kill -INT -$$; echo end main' > <SHELL> -c 'trap "echo trap;" INT; (trap - SIGINT sleep 2; echo end sub) > & sleep 1; kill -INT -$$; echo end main' > <SHELL> -c 'trap "echo trap;" INT; (trap "exit" SIGINT sleep 2; echo end > sub) & sleep 1; kill -INT -$$; echo end main' > Thanks for the demos above, it shows the difference clearly. > SIGINT handling for child processes is strange. This might have some > implication for the shell tests, > because it is possible, that SIGINT is not delivered to all processes > and some may reside as orphans. > Since this can happen only in case of timeouts, I guess there is no real > Problem. Yes. Looks like the behaviors on signal 'SIGINT' are not unify in background processes handling for different SHELL. So as you said that using SIGINT seems NOT a good idea to stop the process in timeout. > > A possible fix could be using SIGTERM instead of SIGINT. This signal > does not seem to have some "intelligent" handling for background processes. I agree. Can you make a patch to replace that INT? (and this is only a timeout issue, so patch merging may be delayed due to LTP new release) > I do not know why LTP used SIGINT in the first place. My first thought > would have been to use SIGTERM. It is the way to "politely ask > processes to terminate" Yes, but that not strange to me, the possible reason is just to stop(ctrl ^c) the LTP test manually for debugging, so we went too far for using SIGINT but forget the original purpose :).
Hi Li, >> A possible fix could be using SIGTERM instead of SIGINT. This signal >> does not seem to have some "intelligent" handling for background processes. > I agree. Can you make a patch to replace that INT? > > (and this is only a timeout issue, so patch merging may be delayed due > to LTP new release) I'd like to supply the patch, I've placed it on my todo list. I will probably not finish it before the release, but since it will probably not be included anyway, it doesn't matter. I do not know why LTP used SIGINT in the first place. My first thought >> would have been to use SIGTERM. It is the way to "politely ask >> processes to terminate" > Yes, but that not strange to me, the possible reason is just to > stop(ctrl ^c) the LTP test manually for debugging, so we went > too far for using SIGINT but forget the original purpose :). Ok sounds reasonable. The nice think about changing timeout to SIGTERM would be, that abort using CTRL+C is clearly distinguishable from a timeout. Jörg
Hi all, ... > > To conclude: > > - bash does not seem to care about SIGINT delivery to background > > processes, but can be blocked using trap > > - zsh ignores SIGINT for background processes by default, but can be > > allowed using trap > > - dash and busybox sh ignore the signal to background processes, and > > this cannot be changed with trap > > I tried with the following snippets: > > <SHELL> -c 'trap "echo trap;" INT; (sleep 2; echo end sub) & sleep 1; > > kill -INT -$$; echo end main' > > <SHELL> -c 'trap "echo trap;" INT; (trap - SIGINT sleep 2; echo end sub) > > & sleep 1; kill -INT -$$; echo end main' > > <SHELL> -c 'trap "echo trap;" INT; (trap "exit" SIGINT sleep 2; echo end > > sub) & sleep 1; kill -INT -$$; echo end main' FYI (you probably know it) SIGINT is a bashism, INT needs to be used. $ kill -s SIGINT $$ dash: 2: kill: invalid signal number or name: SIGINT > Thanks for the demos above, it shows the difference clearly. > > SIGINT handling for child processes is strange. This might have some > > implication for the shell tests, > > because it is possible, that SIGINT is not delivered to all processes > > and some may reside as orphans. > > Since this can happen only in case of timeouts, I guess there is no real > > Problem. > Yes. > Looks like the behaviors on signal 'SIGINT' are not unify in background > processes handling for different SHELL. So as you said that using SIGINT > seems NOT a good idea to stop the process in timeout. Yes, trap looks to be having some portability issues [1] [2] > > A possible fix could be using SIGTERM instead of SIGINT. This signal > > does not seem to have some "intelligent" handling for background processes. > I agree. Can you make a patch to replace that INT? > (and this is only a timeout issue, so patch merging may be delayed due > to LTP new release) > > I do not know why LTP used SIGINT in the first place. My first thought > > would have been to use SIGTERM. It is the way to "politely ask > > processes to terminate" > Yes, but that not strange to me, the possible reason is just to > stop(ctrl ^c) the LTP test manually for debugging, so we went > too far for using SIGINT but forget the original purpose :). I'm not the author, but yes, SIGINT was used for stopping with ctrl^c during debugging. FYI I tried to use both SIGINT and SIGTERM, but there was some problem. But I guess it was my error. Please *add* SIGTERM (keep SIGINT). Kind regards, Petr [1] https://unix.stackexchange.com/questions/240723/exit-trap-in-dash-vs-ksh-and-bash/240736#240736 [2] https://stackoverflow.com/questions/27012762/is-trap-exit-required-to-execute-in-case-of-sigint-or-sigterm-received
> FYI I tried to use both SIGINT and SIGTERM, but there was some problem. > But I guess it was my error. Please *add* SIGTERM (keep SIGINT). Yes, we'd better keep SIGINT for ctrl^c action and use SIGTERM additionally for process terminating. Does this below (rough solution in my mind) work for you? diff --git a/lib/newlib_tests/shell/timeout03.sh b/lib/newlib_tests/shell/timeout03.sh index cd548d9a2..f39f5712a 100755 --- a/lib/newlib_tests/shell/timeout03.sh +++ b/lib/newlib_tests/shell/timeout03.sh @@ -30,6 +30,7 @@ TST_TIMEOUT=1 do_test() { + trap "echo 'Sorry, timeout03 is still alive'" TERM tst_res TINFO "testing killing test after TST_TIMEOUT" sleep 2 diff --git a/testcases/lib/tst_test.sh b/testcases/lib/tst_test.sh index 28c2052d6..d7c9791e9 100644 --- a/testcases/lib/tst_test.sh +++ b/testcases/lib/tst_test.sh @@ -21,7 +21,7 @@ export TST_LIB_LOADED=1 . tst_security.sh # default trap function -trap "tst_brk TBROK 'test interrupted or timed out'" INT +trap "tst_brk TBROK 'test interrupted'" INT _tst_do_exit() { @@ -439,18 +439,18 @@ _tst_kill_test() { local i=10 - trap '' INT - tst_res TBROK "Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" - kill -INT -$pid + trap '' TERM + tst_res TBROK "Test timeouted, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" + kill -TERM -$pid tst_sleep 100ms - while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do + while kill -0 $pid &>/dev/null && [ $i -gt 0 ]; do tst_res TINFO "Test is still running, waiting ${i}s" sleep 1 i=$((i-1)) done - if kill -0 $pid 2>&1 > /dev/null; then + if kill -0 $pid &>/dev/null; then tst_res TBROK "Test still running, sending SIGKILL" kill -KILL -$pid fi
> > FYI I tried to use both SIGINT and SIGTERM, but there was some problem. > > But I guess it was my error. Please *add* SIGTERM (keep SIGINT). > Yes, we'd better keep SIGINT for ctrl^c action and use SIGTERM > additionally for process terminating. > Does this below (rough solution in my mind) work for you? LGTM, but Joerg, Metan, could you please have a look? > diff --git a/lib/newlib_tests/shell/timeout03.sh > b/lib/newlib_tests/shell/timeout03.sh > index cd548d9a2..f39f5712a 100755 > --- a/lib/newlib_tests/shell/timeout03.sh > +++ b/lib/newlib_tests/shell/timeout03.sh > @@ -30,6 +30,7 @@ TST_TIMEOUT=1 > do_test() > { > + trap "echo 'Sorry, timeout03 is still alive'" TERM Any reason why not use tst_res TINFO? (working on bash). > tst_res TINFO "testing killing test after TST_TIMEOUT" > sleep 2 > diff --git a/testcases/lib/tst_test.sh b/testcases/lib/tst_test.sh > index 28c2052d6..d7c9791e9 100644 > --- a/testcases/lib/tst_test.sh > +++ b/testcases/lib/tst_test.sh > @@ -21,7 +21,7 @@ export TST_LIB_LOADED=1 > . tst_security.sh > # default trap function > -trap "tst_brk TBROK 'test interrupted or timed out'" INT > +trap "tst_brk TBROK 'test interrupted'" INT > _tst_do_exit() > { > @@ -439,18 +439,18 @@ _tst_kill_test() > { > local i=10 > - trap '' INT > - tst_res TBROK "Test timeouted, sending SIGINT! If you are > running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" > - kill -INT -$pid > + trap '' TERM > + tst_res TBROK "Test timeouted, sending SIGTERM! If you are > running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" > + kill -TERM -$pid > tst_sleep 100ms > - while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do > + while kill -0 $pid &>/dev/null && [ $i -gt 0 ]; do FYI: &> is a bashism (we need to keep the original). > tst_res TINFO "Test is still running, waiting ${i}s" > sleep 1 > i=$((i-1)) > done > - if kill -0 $pid 2>&1 > /dev/null; then > + if kill -0 $pid &>/dev/null; then And here as well. > tst_res TBROK "Test still running, sending SIGKILL" > kill -KILL -$pid > fi Kind regards, Petr
Petr Vorel <pvorel@suse.cz> wrote: > > Does this below (rough solution in my mind) work for you? > LGTM, but Joerg, Metan, could you please have a look? Thanks, I wouldn't send a patch until Joerg/Cyril has a review. (Maybe Joerg will have another better solution:) > > diff --git a/lib/newlib_tests/shell/timeout03.sh > > b/lib/newlib_tests/shell/timeout03.sh > > index cd548d9a2..f39f5712a 100755 > > --- a/lib/newlib_tests/shell/timeout03.sh > > +++ b/lib/newlib_tests/shell/timeout03.sh > > @@ -30,6 +30,7 @@ TST_TIMEOUT=1 > > > do_test() > > { > > + trap "echo 'Sorry, timeout03 is still alive'" TERM > Any reason why not use tst_res TINFO? (working on bash). +1 I just use echo for a quick test, but surely we can replace it with tst_res. > > - while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do > > + while kill -0 $pid &>/dev/null && [ $i -gt 0 ]; do > FYI: &> is a bashism (we need to keep the original). I just want the error does not to redirect to standard output. Because with SIGTERM sending, it seems easier to kill all processes, so 'kill -0 $pid' returns "No such process" errors often. Mayby I should go with: kill -0 $pid >/dev/null 2>&1 e.g. # ./timeout04.sh timeout04 1 TINFO: timeout per run is 0h 0m 1s timeout04 1 TINFO: Start timeout04 1 TBROK: Test timeouted, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 Terminated ./../../../testcases/lib/tst_test.sh: line 448: kill: (74911) - No such process ./../../../testcases/lib/tst_test.sh: line 454: kill: (74911) - No such process
Hi Li, Joerg, Cyril, > Petr Vorel <pvorel@suse.cz> wrote: > > > Does this below (rough solution in my mind) work for you? > > LGTM, but Joerg, Metan, could you please have a look? > Thanks, I wouldn't send a patch until Joerg/Cyril has a review. > (Maybe Joerg will have another better solution:) Understand. > > > diff --git a/lib/newlib_tests/shell/timeout03.sh > > > b/lib/newlib_tests/shell/timeout03.sh > > > index cd548d9a2..f39f5712a 100755 > > > --- a/lib/newlib_tests/shell/timeout03.sh > > > +++ b/lib/newlib_tests/shell/timeout03.sh > > > @@ -30,6 +30,7 @@ TST_TIMEOUT=1 > > > do_test() > > > { > > > + trap "echo 'Sorry, timeout03 is still alive'" TERM > > Any reason why not use tst_res TINFO? (working on bash). > +1 > I just use echo for a quick test, but surely we can replace it with tst_res. +1 > > > - while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do > > > + while kill -0 $pid &>/dev/null && [ $i -gt 0 ]; do > > FYI: &> is a bashism (we need to keep the original). > I just want the error does not to redirect to standard output. > Because with SIGTERM sending, it seems easier to kill all > processes, so 'kill -0 $pid' returns "No such process" errors often. > Mayby I should go with: kill -0 $pid >/dev/null 2>&1 +1. Now I see it, "2>&1 > /dev/null" is wrong, it must be vice versa. Kind regards, Petr > e.g. > # ./timeout04.sh > timeout04 1 TINFO: timeout per run is 0h 0m 1s > timeout04 1 TINFO: Start > timeout04 1 TBROK: Test timeouted, sending SIGTERM! If you are running > on slow machine, try exporting LTP_TIMEOUT_MUL > 1 > Terminated > ./../../../testcases/lib/tst_test.sh: line 448: kill: (74911) - No such process > ./../../../testcases/lib/tst_test.sh: line 454: kill: (74911) - No such process
Hi Li, On 5/13/2021 7:08 AM, Li Wang wrote: >> FYI I tried to use both SIGINT and SIGTERM, but there was some problem. >> But I guess it was my error. Please *add* SIGTERM (keep SIGINT). > Yes, we'd better keep SIGINT for ctrl^c action and use SIGTERM > additionally for process terminating. > > Does this below (rough solution in my mind) work for you? > > diff --git a/lib/newlib_tests/shell/timeout03.sh > b/lib/newlib_tests/shell/timeout03.sh > index cd548d9a2..f39f5712a 100755 > --- a/lib/newlib_tests/shell/timeout03.sh > +++ b/lib/newlib_tests/shell/timeout03.sh > @@ -30,6 +30,7 @@ TST_TIMEOUT=1 > > do_test() > { > + trap "echo 'Sorry, timeout03 is still alive'" TERM > tst_res TINFO "testing killing test after TST_TIMEOUT" > > sleep 2 > diff --git a/testcases/lib/tst_test.sh b/testcases/lib/tst_test.sh > index 28c2052d6..d7c9791e9 100644 > --- a/testcases/lib/tst_test.sh > +++ b/testcases/lib/tst_test.sh > @@ -21,7 +21,7 @@ export TST_LIB_LOADED=1 > . tst_security.sh > > # default trap function > -trap "tst_brk TBROK 'test interrupted or timed out'" INT > +trap "tst_brk TBROK 'test interrupted'" INT This would require something like trap "tst_brk TBROK 'test terminated'" TERM or trap "_tst_do_exit" TERM Otherwise the test is terminated very roughly, without executing cleanup, which is probably not a good idea. But that introduces the next problem: A short deadlock between _tst_kill_test and _tst_cleanup_timer, because _tst_cleanup_timer waits for the termination of the timeout process and vice versa. Another problem is, that a SIGTERM originating from some other location could look like a timeout. I am currently thinking about the following solution, to mitigate most problems: The timeout process sends SIGUSR1 (or maybe SIGALRM?) only to the main test process and blocks TERM. The main process can print, that it ran into a timeout, send a sigterm to its processs group (while ignoring TERM itself). Then it can unset $_tst_setup_timer_pid safely, because it knows it was triggered by the timeout process and execute _tst_do_exit. If the timeout process does not see the termination of the main process, it can still send SIGKILL to the whole process group. Jörg
Hi Joerg, > > -trap "tst_brk TBROK 'test interrupted or timed out'" INT > > +trap "tst_brk TBROK 'test interrupted'" INT > This would require something like > trap "tst_brk TBROK 'test terminated'" TERM > or > trap "_tst_do_exit" TERM > > Otherwise the test is terminated very roughly, without executing > cleanup, which is probably not a good idea. Yes, seems I didn't realize this needs cleanup as well. But I'd still suggest keeping SIGINT here for catching Ctrl^C for users :). > > But that introduces the next problem: A short deadlock between > _tst_kill_test and _tst_cleanup_timer, > because _tst_cleanup_timer waits for the termination of the timeout > process and vice versa. > Another problem is, that a SIGTERM originating from some other location > could look like a timeout. Yes, and that's the reason why I didn't trap SIGTERM simply in the main process. > I am currently thinking about the following solution, to mitigate most > problems: > The timeout process sends SIGUSR1 (or maybe SIGALRM?) only to the main > test process and blocks TERM. > The main process can print, that it ran into a timeout, send a sigterm > to its processs group (while ignoring TERM itself). > Then it can unset $_tst_setup_timer_pid safely, because it knows it was > triggered by the timeout process and execute _tst_do_exit. > > If the timeout process does not see the termination of the main process, > it can still send SIGKILL to the whole process group. It probably will be work but looks a bit confusing since that involves more signals. In conclusion, I think we maybe have such situations to be solved: 1. SIGINT (Ctrl^C) for terminating the main process and do cleanup correctly before a timeout 2. Test finish normally and retrieves the _tst_timeout_process in the background via SIGTERM(sending by _tst_cleanup_timer) 3. Test timeout occurs and _tst_kill_test sending SIGTERM to terminating all process, and the main process do cleanup work 4. Test timeout occurs but still have process alive after _tst_kill_test sending SIGTERM, then sending SIGKILL to the whole group So, I'm now thinking can we just introduce a knob(variable) for skipping the _tst_cleanup_timer works in timeout mode, then it will not have a deadlock anymore. How about: --- a/testcases/lib/tst_test.sh +++ b/testcases/lib/tst_test.sh @@ -16,12 +16,14 @@ export TST_COUNT=1 export TST_ITERATIONS=1 export TST_TMPDIR_RHOST=0 export TST_LIB_LOADED=1 +export TST_TIMEOUT_OCCUR=0 . tst_ansi_color.sh . tst_security.sh # default trap function -trap "tst_brk TBROK 'test interrupted or timed out'" INT +trap "tst_brk TBROK 'test interrupted'" INT +trap "TST_TIMEOUT_OCCUR=1; tst_brk TBROK 'test timeouted'" TERM _tst_do_exit() { @@ -48,7 +50,9 @@ _tst_do_exit() [ "$TST_TMPDIR_RHOST" = 1 ] && tst_cleanup_rhost fi - _tst_cleanup_timer + if ["$TST_TIMEOUT_OCCUR" = 0 ]; then + _tst_cleanup_timer + fi if [ $TST_FAIL -gt 0 ]; then ret=$((ret|1)) @@ -439,18 +443,18 @@ _tst_kill_test() { local i=10 - trap '' INT - tst_res TBROK "Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" - kill -INT -$pid + trap '' TERM + tst_res TBROK "Test timeouted, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" + kill -TERM -$pid tst_sleep 100ms - while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do + while kill -0 $pid >/dev/null 2>&1 && [ $i -gt 0 ]; do tst_res TINFO "Test is still running, waiting ${i}s" sleep 1 i=$((i-1)) done - if kill -0 $pid 2>&1 > /dev/null; then + if kill -0 $pid >/dev/null 2>&1; then tst_res TBROK "Test still running, sending SIGKILL" kill -KILL -$pid fi -- Regards, Li Wang
Hi all, > Hi Joerg, > > > -trap "tst_brk TBROK 'test interrupted or timed out'" INT > > > +trap "tst_brk TBROK 'test interrupted'" INT > > This would require something like > > trap "tst_brk TBROK 'test terminated'" TERM > > or > > trap "_tst_do_exit" TERM > > Otherwise the test is terminated very roughly, without executing > > cleanup, which is probably not a good idea. +1 > Yes, seems I didn't realize this needs cleanup as well. > But I'd still suggest keeping SIGINT here for catching Ctrl^C for users :). +1 > > But that introduces the next problem: A short deadlock between > > _tst_kill_test and _tst_cleanup_timer, > > because _tst_cleanup_timer waits for the termination of the timeout > > process and vice versa. > > Another problem is, that a SIGTERM originating from some other location > > could look like a timeout. > Yes, and that's the reason why I didn't trap SIGTERM simply in the main process. > > I am currently thinking about the following solution, to mitigate most > > problems: > > The timeout process sends SIGUSR1 (or maybe SIGALRM?) only to the main > > test process and blocks TERM. > > The main process can print, that it ran into a timeout, send a sigterm > > to its processs group (while ignoring TERM itself). > > Then it can unset $_tst_setup_timer_pid safely, because it knows it was > > triggered by the timeout process and execute _tst_do_exit. > > If the timeout process does not see the termination of the main process, > > it can still send SIGKILL to the whole process group. > It probably will be work but looks a bit confusing since that involves > more signals. > In conclusion, I think we maybe have such situations to be solved: > 1. SIGINT (Ctrl^C) for terminating the main process and do cleanup > correctly before a timeout > 2. Test finish normally and retrieves the _tst_timeout_process in the > background via SIGTERM(sending by _tst_cleanup_timer) > 3. Test timeout occurs and _tst_kill_test sending SIGTERM to > terminating all process, and the main process do cleanup work > 4. Test timeout occurs but still have process alive after > _tst_kill_test sending SIGTERM, then sending SIGKILL to the whole > group > So, I'm now thinking can we just introduce a knob(variable) for skipping > the _tst_cleanup_timer works in timeout mode, then it will not have a > deadlock anymore. +1 > How about: I'm not sure if we're not getting too late for these changes. Because it'll be just on us to test that (community probably have run these changes). But I'd still prefer to fix it. If we don't fix it, I'd be at least for fixing wrong redirection order (2>&1 > /dev/null) introduced by me in 25ad54dba. > --- a/testcases/lib/tst_test.sh > +++ b/testcases/lib/tst_test.sh > @@ -16,12 +16,14 @@ export TST_COUNT=1 > export TST_ITERATIONS=1 > export TST_TMPDIR_RHOST=0 > export TST_LIB_LOADED=1 > +export TST_TIMEOUT_OCCUR=0 > . tst_ansi_color.sh > . tst_security.sh > # default trap function > -trap "tst_brk TBROK 'test interrupted or timed out'" INT > +trap "tst_brk TBROK 'test interrupted'" INT > +trap "TST_TIMEOUT_OCCUR=1; tst_brk TBROK 'test timeouted'" TERM > _tst_do_exit() > { > @@ -48,7 +50,9 @@ _tst_do_exit() > [ "$TST_TMPDIR_RHOST" = 1 ] && tst_cleanup_rhost > fi > - _tst_cleanup_timer > + if ["$TST_TIMEOUT_OCCUR" = 0 ]; then > + _tst_cleanup_timer > + fi > if [ $TST_FAIL -gt 0 ]; then > ret=$((ret|1)) > @@ -439,18 +443,18 @@ _tst_kill_test() > { > local i=10 > - trap '' INT > - tst_res TBROK "Test timeouted, sending SIGINT! If you are > running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" > - kill -INT -$pid > + trap '' TERM > + tst_res TBROK "Test timeouted, sending SIGTERM! If you are > running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" > + kill -TERM -$pid > tst_sleep 100ms > - while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do > + while kill -0 $pid >/dev/null 2>&1 && [ $i -gt 0 ]; do > tst_res TINFO "Test is still running, waiting ${i}s" > sleep 1 > i=$((i-1)) > done > - if kill -0 $pid 2>&1 > /dev/null; then > + if kill -0 $pid >/dev/null 2>&1; then > tst_res TBROK "Test still running, sending SIGKILL" > kill -KILL -$pid > fi LGTM, I'll try to test it today. Kind regards, Petr
Hi, On 5/18/2021 9:27 AM, Li Wang wrote: > Hi Joerg, > >>> -trap "tst_brk TBROK 'test interrupted or timed out'" INT >>> +trap "tst_brk TBROK 'test interrupted'" INT >> This would require something like >> trap "tst_brk TBROK 'test terminated'" TERM >> or >> trap "_tst_do_exit" TERM >> >> Otherwise the test is terminated very roughly, without executing >> cleanup, which is probably not a good idea. > Yes, seems I didn't realize this needs cleanup as well. > > But I'd still suggest keeping SIGINT here for catching Ctrl^C for users :). +1, I never intended to remove thi >> I am currently thinking about the following solution, to mitigate most >> problems: >> The timeout process sends SIGUSR1 (or maybe SIGALRM?) only to the main >> test process and blocks TERM. >> The main process can print, that it ran into a timeout, send a sigterm >> to its processs group (while ignoring TERM itself). >> Then it can unset $_tst_setup_timer_pid safely, because it knows it was >> triggered by the timeout process and execute _tst_do_exit. >> >> If the timeout process does not see the termination of the main process, >> it can still send SIGKILL to the whole process group. > > It probably will be work but looks a bit confusing since that involves > more signals. > > In conclusion, I think we maybe have such situations to be solved: > > 1. SIGINT (Ctrl^C) for terminating the main process and do cleanup > correctly before a timeout > 2. Test finish normally and retrieves the _tst_timeout_process in the > background via SIGTERM(sending by _tst_cleanup_timer) > 3. Test timeout occurs and _tst_kill_test sending SIGTERM to > terminating all process, and the main process do cleanup work > 4. Test timeout occurs but still have process alive after > _tst_kill_test sending SIGTERM, then sending SIGKILL to the whole > group > > So, I'm now thinking can we just introduce a knob(variable) for skipping > the _tst_cleanup_timer works in timeout mode, then it will not have a > deadlock anymore. This works of course and is the "simplest" solution, the only thing I do not like about this, is the fact, that SIGTERM send by something else (e.g. system shoutdown or process manager), is handled like timeouts are handled and reported as timeout. That's why I suggested introducing a new signal. But since this is probably rare, I could live without it. > > How about: > > --- a/testcases/lib/tst_test.sh > +++ b/testcases/lib/tst_test.sh > @@ -16,12 +16,14 @@ export TST_COUNT=1 > export TST_ITERATIONS=1 > export TST_TMPDIR_RHOST=0 > export TST_LIB_LOADED=1 > +export TST_TIMEOUT_OCCUR=0 > > . tst_ansi_color.sh > . tst_security.sh > > # default trap function > -trap "tst_brk TBROK 'test interrupted or timed out'" INT > +trap "tst_brk TBROK 'test interrupted'" INT > +trap "TST_TIMEOUT_OCCUR=1; tst_brk TBROK 'test timeouted'" TERM This could also be done by "unset _tst_setup_timer_pid" or '_tst_setup_timer_pid=""'. I guess even if a new variable is introduced, it should start with an _, because it is supposed to be internal to the framework? > > _tst_do_exit() > { > @@ -48,7 +50,9 @@ _tst_do_exit() > [ "$TST_TMPDIR_RHOST" = 1 ] && tst_cleanup_rhost > fi > > - _tst_cleanup_timer > + if ["$TST_TIMEOUT_OCCUR" = 0 ]; then > + _tst_cleanup_timer > + fi > > if [ $TST_FAIL -gt 0 ]; then > ret=$((ret|1)) > @@ -439,18 +443,18 @@ _tst_kill_test() > { > local i=10 > > - trap '' INT > - tst_res TBROK "Test timeouted, sending SIGINT! If you are > running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" > - kill -INT -$pid > + trap '' TERM > + tst_res TBROK "Test timeouted, sending SIGTERM! If you are > running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" If you post this as a patch, can you please fix "timeouted" => "timed out"? There is no word "timeouted" in the english language. > + kill -TERM -$pid > tst_sleep 100ms > > - while kill -0 $pid 2>&1 > /dev/null && [ $i -gt 0 ]; do > + while kill -0 $pid >/dev/null 2>&1 && [ $i -gt 0 ]; do > tst_res TINFO "Test is still running, waiting ${i}s" > sleep 1 > i=$((i-1)) > done > > - if kill -0 $pid 2>&1 > /dev/null; then > + if kill -0 $pid >/dev/null 2>&1; then > tst_res TBROK "Test still running, sending SIGKILL" > kill -KILL -$pid > fi > > > -- > Regards, > Li Wang @Petr I wouldn't recommend getting the fix into the release. The problem is nothing new and does not fix a "real issue" at the moment, but has the risk of introducing something unexpected. Fixing the output redirection could be done without a major risk, I guess. Jörg
Hi all, > > In conclusion, I think we maybe have such situations to be solved: > > > > 1. SIGINT (Ctrl^C) for terminating the main process and do cleanup > > correctly before a timeout > > 2. Test finish normally and retrieves the _tst_timeout_process in the > > background via SIGTERM(sending by _tst_cleanup_timer) > > 3. Test timeout occurs and _tst_kill_test sending SIGTERM to > > terminating all process, and the main process do cleanup work > > 4. Test timeout occurs but still have process alive after > > _tst_kill_test sending SIGTERM, then sending SIGKILL to the whole > > group > > > > So, I'm now thinking can we just introduce a knob(variable) for skipping > > the _tst_cleanup_timer works in timeout mode, then it will not have a > > deadlock anymore. > This works of course and is the "simplest" solution, the only thing I do > not like about this, > is the fact, that SIGTERM send by something else (e.g. system shoutdown > or process manager), > is handled like timeouts are handled and reported as timeout. That's why > I suggested introducing > a new signal. But since this is probably rare, I could live without it. Hmm, it wouldn't handle/report like a time-out if we break with "test terminated" output for a SIGTERM. If we do trap "unset _tst_setup_timer_pid; tst_brk TBROK 'test terminated'" TERM in the main process, system will still send SIGTERM to the _tst_timeout_process when shutting down, and the _tst_kill_test will never be called in that case. > > > > > > How about: > > > > --- a/testcases/lib/tst_test.sh > > +++ b/testcases/lib/tst_test.sh > > @@ -16,12 +16,14 @@ export TST_COUNT=1 > > export TST_ITERATIONS=1 > > export TST_TMPDIR_RHOST=0 > > export TST_LIB_LOADED=1 > > +export TST_TIMEOUT_OCCUR=0 > > > > . tst_ansi_color.sh > > . tst_security.sh > > > > # default trap function > > -trap "tst_brk TBROK 'test interrupted or timed out'" INT > > +trap "tst_brk TBROK 'test interrupted'" INT > > +trap "TST_TIMEOUT_OCCUR=1; tst_brk TBROK 'test timeouted'" TERM > This could also be done by "unset _tst_setup_timer_pid" or > '_tst_setup_timer_pid=""'. +1, 'unset _tst_setup_timer_pid' is a good idea, sorry I was blind here when reading your previous email:). > I guess even if a new variable is introduced, it should start with an _, > because it is supposed to be internal to the framework? Yes, but let's go with "unset _tst_setup_timer_pid" but not introduce a new variable. > > > > > > _tst_do_exit() > > { > > @@ -48,7 +50,9 @@ _tst_do_exit() > > [ "$TST_TMPDIR_RHOST" = 1 ] && tst_cleanup_rhost > > fi > > > > - _tst_cleanup_timer > > + if ["$TST_TIMEOUT_OCCUR" = 0 ]; then > > + _tst_cleanup_timer > > + fi > > > > if [ $TST_FAIL -gt 0 ]; then > > ret=$((ret|1)) > > @@ -439,18 +443,18 @@ _tst_kill_test() > > { > > local i=10 > > > > - trap '' INT > > - tst_res TBROK "Test timeouted, sending SIGINT! If you are > > running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" > > - kill -INT -$pid > > + trap '' TERM > > + tst_res TBROK "Test timeouted, sending SIGTERM! If you are > > running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" > If you post this as a patch, can you please fix "timeouted" => "timed out"? > There is no word "timeouted" in the english language. Sure. Thanks for your strict attitude on syntax. > @Petr > I wouldn't recommend getting the fix into the release. > The problem is nothing new and does not fix a "real issue" at the moment, > but has the risk of introducing something unexpected. > Fixing the output redirection could be done without a major risk, I guess. I will split the fix into two-part, one for errors redirection, another for SIGTERM using. Thanks for your review! -- Regards, Li Wang
diff --git a/testcases/lib/tst_test.sh b/testcases/lib/tst_test.sh index ed2699175..b6ca0cb26 100644 --- a/testcases/lib/tst_test.sh +++ b/testcases/lib/tst_test.sh @@ -439,6 +439,7 @@ _tst_kill_test() { local i=10 + trap '' INT tst_res TBROK "Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" kill -INT -$pid tst_sleep 100ms