Message ID | 20210519085812.27263-1-liwang@redhat.com |
---|---|
State | Accepted |
Headers | show |
Series | [v2] tst_test: using SIGTERM to terminate process | expand |
Hi, On 5/19/2021 10:58 AM, Li Wang wrote: > diff --git a/lib/newlib_tests/shell/timeout03.sh b/lib/newlib_tests/shell/timeout03.sh > index cd548d9a2..dabef32a2 100755 > --- a/lib/newlib_tests/shell/timeout03.sh > +++ b/lib/newlib_tests/shell/timeout03.sh > @@ -6,8 +6,9 @@ > # expected output: > # timeout03 1 TINFO: timeout per run is 0h 0m 1s > # timeout03 1 TINFO: testing killing test after TST_TIMEOUT > -# timeout03 1 TBROK: Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 > -# timeout03 1 TBROK: test interrupted or timed out > +# timeout03 1 TBROK: Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 > +# Terminated Nit: This Message seems to be only there for bash... I tried to find why and if it can be suppressed, but found nothing. I guess it comes from the terminated sleep process. Jörg
On Wed, May 19, 2021 at 5:19 PM Joerg Vehlow <lkml@jv-coder.de> wrote: > > Hi, > > On 5/19/2021 10:58 AM, Li Wang wrote: > > diff --git a/lib/newlib_tests/shell/timeout03.sh b/lib/newlib_tests/shell/timeout03.sh > > index cd548d9a2..dabef32a2 100755 > > --- a/lib/newlib_tests/shell/timeout03.sh > > +++ b/lib/newlib_tests/shell/timeout03.sh > > @@ -6,8 +6,9 @@ > > # expected output: > > # timeout03 1 TINFO: timeout per run is 0h 0m 1s > > # timeout03 1 TINFO: testing killing test after TST_TIMEOUT > > -# timeout03 1 TBROK: Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 > > -# timeout03 1 TBROK: test interrupted or timed out > > +# timeout03 1 TBROK: Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 > > +# Terminated > Nit: This Message seems to be only there for bash... I tried to find why > and if it can be suppressed, but found nothing. I guess it comes from > the terminated sleep process. Yes, but that's not a big deal, it only appears in timed out and we can tolerate:).
Hi Cyril, Petr, Any thoughts on merging this one?
Hi Li, all, > We'd better avoid using SIGINT for process terminating becuasue, > it has different behavoir on kind of shell. > From Joerg Vehlow's test: > - bash does not seem to care about SIGINT delivery to background > processes, but can be blocked using trap > - zsh ignores SIGINT for background processes by default, but can be > allowed using trap > - dash and busybox sh ignore the signal to background processes, and > this cannot be changed with trap > This patch cover the below situations: > 1. SIGINT (Ctrl^C) for terminating the main process and do cleanup > correctly before a timeout > 2. Test finish normally and retrieves the _tst_timeout_process in the > background via SIGTERM(sending by _tst_cleanup_timer) > 3. Test timed out occurs and _tst_kill_test sending SIGTERM to > terminating all process, and the main process do cleanup work > 4. Test timed out occurs but still have process alive after _tst_kill_test > sending SIGTERM, then sending SIGKILL to the whole group > 5. Test terminated by SIGTERM unexpectly (e.g. system shutdown or process > manager) and do cleanup work as well > Co-authored-by: Joerg Vehlow <joerg.vehlow@aox-tech.de> > Signed-off-by: Li Wang <liwang@redhat.com> > Reviewed-by: Joerg Vehlow <joerg.vehlow@aox-tech.de> ... > +++ b/testcases/lib/tst_test.sh > @@ -21,7 +21,8 @@ export TST_LIB_LOADED=1 > . tst_security.sh > # default trap function > -trap "tst_brk TBROK 'test interrupted or timed out'" INT > +trap "tst_brk TBROK 'test interrupted'" INT > +trap "unset _tst_setup_timer_pid; tst_brk TBROK 'test terminated'" TERM FYI this commit (merged as 4a6b8a697 ("tst_test: using SIGTERM to terminate process")) broke net_stress_interface tests, particularly tst_require_cmds() call (which calls tst_brk TCONF: # ./if-addr-adddel.sh -c ifconfig if-addr-adddel 1 TINFO: initialize 'lhost' 'ltp_ns_veth2' interface if-addr-adddel 1 TINFO: add local addr 10.0.0.2/24 if-addr-adddel 1 TINFO: add local addr fd00:1:1:1::2/64 if-addr-adddel 1 TINFO: initialize 'rhost' 'ltp_ns_veth1' interface if-addr-adddel 1 TINFO: add remote addr 10.0.0.1/24 if-addr-adddel 1 TINFO: add remote addr fd00:1:1:1::1/64 if-addr-adddel 1 TINFO: Network config (local -- remote): if-addr-adddel 1 TINFO: ltp_ns_veth2 -- ltp_ns_veth1 if-addr-adddel 1 TINFO: 10.0.0.2/24 -- 10.0.0.1/24 if-addr-adddel 1 TINFO: fd00:1:1:1::2/64 -- fd00:1:1:1::1/64 if-addr-adddel 1 TINFO: timeout per run is 0h 5m 0s if-addr-adddel 1 TCONF: 'ifconfig' not found => waits till timeout if-addr-adddel 1 TBROK: Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 if-addr-adddel 1 TWARN: test terminated Debugging it hangs in wait in _tst_cleanup_timer(): kill -TERM $_tst_setup_timer_pid 2>/dev/null wait $_tst_setup_timer_pid 2>/dev/null because kill does not kill the test. The problem looks to be that unset actually does not work. trap "unset _tst_setup_timer_pid; tst_brk TBROK 'test terminated'" TERM It looks to be something setup specific, because I discovered this on SLES on both bash and dash. Running it on current Debian testing it works on both bash and dash. I checked shopt output on both, but don't see anything obvious. It must be something else. Kind regards, Petr > _tst_do_exit() > { > @@ -439,9 +440,9 @@ _tst_kill_test() > { > local i=10 > - trap '' INT > - tst_res TBROK "Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" > - kill -INT -$pid > + trap '' TERM > + tst_res TBROK "Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" > + kill -TERM -$pid > tst_sleep 100ms > while kill -0 $pid >/dev/null 2>&1 && [ $i -gt 0 ]; do
Hi Li, all, > Hi Li, all, [ Cc Cyril and Alexey ] > > We'd better avoid using SIGINT for process terminating becuasue, > > it has different behavoir on kind of shell. > > From Joerg Vehlow's test: > > - bash does not seem to care about SIGINT delivery to background > > processes, but can be blocked using trap > > - zsh ignores SIGINT for background processes by default, but can be > > allowed using trap > > - dash and busybox sh ignore the signal to background processes, and > > this cannot be changed with trap > > This patch cover the below situations: > > 1. SIGINT (Ctrl^C) for terminating the main process and do cleanup > > correctly before a timeout > > 2. Test finish normally and retrieves the _tst_timeout_process in the > > background via SIGTERM(sending by _tst_cleanup_timer) > > 3. Test timed out occurs and _tst_kill_test sending SIGTERM to > > terminating all process, and the main process do cleanup work > > 4. Test timed out occurs but still have process alive after _tst_kill_test > > sending SIGTERM, then sending SIGKILL to the whole group > > 5. Test terminated by SIGTERM unexpectly (e.g. system shutdown or process > > manager) and do cleanup work as well > > Co-authored-by: Joerg Vehlow <joerg.vehlow@aox-tech.de> > > Signed-off-by: Li Wang <liwang@redhat.com> > > Reviewed-by: Joerg Vehlow <joerg.vehlow@aox-tech.de> > ... > > +++ b/testcases/lib/tst_test.sh > > @@ -21,7 +21,8 @@ export TST_LIB_LOADED=1 > > . tst_security.sh > > # default trap function > > -trap "tst_brk TBROK 'test interrupted or timed out'" INT > > +trap "tst_brk TBROK 'test interrupted'" INT > > +trap "unset _tst_setup_timer_pid; tst_brk TBROK 'test terminated'" TERM > FYI this commit (merged as 4a6b8a697 ("tst_test: using SIGTERM to terminate process")) > broke net_stress_interface tests, particularly tst_require_cmds() call (which > calls tst_brk TCONF: > # ./if-addr-adddel.sh -c ifconfig > if-addr-adddel 1 TINFO: initialize 'lhost' 'ltp_ns_veth2' interface > if-addr-adddel 1 TINFO: add local addr 10.0.0.2/24 > if-addr-adddel 1 TINFO: add local addr fd00:1:1:1::2/64 > if-addr-adddel 1 TINFO: initialize 'rhost' 'ltp_ns_veth1' interface > if-addr-adddel 1 TINFO: add remote addr 10.0.0.1/24 > if-addr-adddel 1 TINFO: add remote addr fd00:1:1:1::1/64 > if-addr-adddel 1 TINFO: Network config (local -- remote): > if-addr-adddel 1 TINFO: ltp_ns_veth2 -- ltp_ns_veth1 > if-addr-adddel 1 TINFO: 10.0.0.2/24 -- 10.0.0.1/24 > if-addr-adddel 1 TINFO: fd00:1:1:1::2/64 -- fd00:1:1:1::1/64 > if-addr-adddel 1 TINFO: timeout per run is 0h 5m 0s > if-addr-adddel 1 TCONF: 'ifconfig' not found > => waits till timeout > if-addr-adddel 1 TBROK: Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 > if-addr-adddel 1 TWARN: test terminated > Debugging it hangs in wait in _tst_cleanup_timer(): > kill -TERM $_tst_setup_timer_pid 2>/dev/null > wait $_tst_setup_timer_pid 2>/dev/null > because kill does not kill the test. > The problem looks to be that unset actually does not work. > trap "unset _tst_setup_timer_pid; tst_brk TBROK 'test terminated'" TERM > It looks to be something setup specific, because I discovered this on SLES on > both bash and dash. Running it on current Debian testing it works on both bash > and dash. I checked shopt output on both, but don't see anything obvious. It > must be something else. OK, repeatedly running on Debian with dash I managed to get hang as well: Here it does not even quit the test: if-addr-adddel 1 TCONF: 'ifconfig' not found if-addr-adddel 1 TBROK: Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 if-addr-adddel 1 TBROK: Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 if-addr-adddel 1 TBROK: Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 if-addr-adddel 1 TBROK: Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 if-addr-adddel 1 TBROK: Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 if-addr-adddel 1 TWARN: test terminated Maybe not only SIGINT, but even SIGTERM is not reliable to background process? Minimal reproducible example, on Dash needs few runs to hang: cat > debug.sh <<EOF #!/bin/sh TST_SETUP="setup" TST_TESTFUNC="do_test" . tst_test.sh setup() { tst_brk TCONF "quit now!" } do_test() { tst_res TPASS "pass :)" } tst_run EOF # while true; do ./debug.sh; done Kind regards, Petr > Kind regards, > Petr > > _tst_do_exit() > > { > > @@ -439,9 +440,9 @@ _tst_kill_test() > > { > > local i=10 > > - trap '' INT > > - tst_res TBROK "Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" > > - kill -INT -$pid > > + trap '' TERM > > + tst_res TBROK "Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" > > + kill -TERM -$pid > > tst_sleep 100ms > > while kill -0 $pid >/dev/null 2>&1 && [ $i -gt 0 ]; do
Hi! > setup() > { > tst_brk TCONF "quit now!" > } > > do_test() > { > tst_res TPASS "pass :)" > } > > tst_run > EOF > > # while true; do ./debug.sh; done I managed to reproduce this in dash. I bet that this is a bug where signal handler inside dash is temporarily disabled when we install the trap and if we manage to hit that window the signal is discarded. At least that is my working theory. After I've installed debug prints, in the cases where it hangs the signal was sent just before have installed the trap. And in some cases when the signal arrives the timer process is killed but the trap is not invoked. So it really looks like signal handling in dash is simply broken. Not sure what we can do about bugs like this apart from switching to a real programming language.
Hi Cyril, > Hi! > > setup() > > { > > tst_brk TCONF "quit now!" > > } > > do_test() > > { > > tst_res TPASS "pass :)" > > } > > tst_run > > EOF > > # while true; do ./debug.sh; done > I managed to reproduce this in dash. I bet that this is a bug where > signal handler inside dash is temporarily disabled when we install the > trap and if we manage to hit that window the signal is discarded. At > least that is my working theory. After I've installed debug prints, in > the cases where it hangs the signal was sent just before have installed > the trap. And in some cases when the signal arrives the timer process is > killed but the trap is not invoked. So it really looks like signal > handling in dash is simply broken. Not sure what we can do about bugs > like this apart from switching to a real programming language. Thanks for the debugging. *bash* is also affected, at least some releases. I reproduced it also on some older SLES, with bash 4.4. Kind regards, Petr
> Hi Cyril, > > Hi! > > > setup() > > > { > > > tst_brk TCONF "quit now!" > > > } > > > do_test() > > > { > > > tst_res TPASS "pass :)" > > > } > > > tst_run > > > EOF > > > # while true; do ./debug.sh; done > > I managed to reproduce this in dash. I bet that this is a bug where > > signal handler inside dash is temporarily disabled when we install the > > trap and if we manage to hit that window the signal is discarded. At > > least that is my working theory. After I've installed debug prints, in > > the cases where it hangs the signal was sent just before have installed > > the trap. And in some cases when the signal arrives the timer process is > > killed but the trap is not invoked. So it really looks like signal > > handling in dash is simply broken. Not sure what we can do about bugs > > like this apart from switching to a real programming language. Which version of bash and dash are you testing on? > Thanks for the debugging. *bash* is also affected, at least some releases. > I reproduced it also on some older SLES, with bash 4.4. dash 0.5.11.4 and 5.1.8 on my Tumbleweed laptop are OK. I tested it on various my VM: dash *failing*: 0.5.8 (SLES), 0.5.11.3 (Tumbleweed), 0.5.11+git20200708+dd9ef66-5 (Debian), 0.5.7-4+b1 (Debian) dash *OK*: 0.5.11.2 (SLES 15), 0.5.10.2 (CentOS) bash *failing*: 5.1.4-1.4 (Tumbleweed), 4.4-9.7.1 (SLES 15) bash *ok*: 4.4-17 (SLES 15), 4.3-83 (SLES 12), 4.3-11+deb8u (Debian), 5.1-2+b3 (Debian), 4.2.46-34 (CentOS) I have no idea what it causes, whether really some bash and dash versions are buggy or it's reproducible only on certain environment. Any tip what to search for? Kind regards, Petr > Kind regards, > Petr
Hi! > > > I managed to reproduce this in dash. I bet that this is a bug where > > > signal handler inside dash is temporarily disabled when we install the > > > trap and if we manage to hit that window the signal is discarded. At > > > least that is my working theory. After I've installed debug prints, in > > > the cases where it hangs the signal was sent just before have installed > > > the trap. And in some cases when the signal arrives the timer process is > > > killed but the trap is not invoked. So it really looks like signal > > > handling in dash is simply broken. Not sure what we can do about bugs > > > like this apart from switching to a real programming language. > Which version of bash and dash are you testing on? > > > Thanks for the debugging. *bash* is also affected, at least some releases. > > I reproduced it also on some older SLES, with bash 4.4. > dash 0.5.11.4 and 5.1.8 on my Tumbleweed laptop are OK. > > I tested it on various my VM: > > dash *failing*: 0.5.8 (SLES), 0.5.11.3 (Tumbleweed), 0.5.11+git20200708+dd9ef66-5 (Debian), 0.5.7-4+b1 (Debian) > dash *OK*: 0.5.11.2 (SLES 15), 0.5.10.2 (CentOS) > > bash *failing*: 5.1.4-1.4 (Tumbleweed), 4.4-9.7.1 (SLES 15) > bash *ok*: 4.4-17 (SLES 15), 4.3-83 (SLES 12), 4.3-11+deb8u (Debian), 5.1-2+b3 > (Debian), 4.2.46-34 (CentOS) > > I have no idea what it causes, whether really some bash and dash versions are > buggy or it's reproducible only on certain environment. bash 5.1.8 seems to work okay, dash 0.5.11.3-r2 seems to fail here. > Any tip what to search for? Not really, apart from reading source code and figuring out exactly what happens here. I guess that we can make things more predictable and easier to read by shifting parts of the shell library to a C process. For instance if we wrote an utility that would implement all the tst_kill_test and timeout process in C we would simplify things greatly. Something as: tst_timeout_kill that would be used as: tst_timeout_kill 300 12342 ^ ^ | process group leader pid timeout in seconds That would implement both the loop for killing tests and timeout processing as well. That way we would get rid of the trap in the subshell and we would end up with a single pid for the whole timeout process and avoid the recursive sigkill to begin with.
> Hi! > > > > I managed to reproduce this in dash. I bet that this is a bug where > > > > signal handler inside dash is temporarily disabled when we install the > > > > trap and if we manage to hit that window the signal is discarded. At > > > > least that is my working theory. After I've installed debug prints, in > > > > the cases where it hangs the signal was sent just before have installed > > > > the trap. And in some cases when the signal arrives the timer process is > > > > killed but the trap is not invoked. So it really looks like signal > > > > handling in dash is simply broken. Not sure what we can do about bugs > > > > like this apart from switching to a real programming language. > > Which version of bash and dash are you testing on? > > > Thanks for the debugging. *bash* is also affected, at least some releases. > > > I reproduced it also on some older SLES, with bash 4.4. > > dash 0.5.11.4 and 5.1.8 on my Tumbleweed laptop are OK. > > I tested it on various my VM: > > dash *failing*: 0.5.8 (SLES), 0.5.11.3 (Tumbleweed), 0.5.11+git20200708+dd9ef66-5 (Debian), 0.5.7-4+b1 (Debian) > > dash *OK*: 0.5.11.2 (SLES 15), 0.5.10.2 (CentOS) > > bash *failing*: 5.1.4-1.4 (Tumbleweed), 4.4-9.7.1 (SLES 15) > > bash *ok*: 4.4-17 (SLES 15), 4.3-83 (SLES 12), 4.3-11+deb8u (Debian), 5.1-2+b3 > > (Debian), 4.2.46-34 (CentOS) > > I have no idea what it causes, whether really some bash and dash versions are > > buggy or it's reproducible only on certain environment. > bash 5.1.8 seems to work okay, dash 0.5.11.3-r2 seems to fail here. > > Any tip what to search for? > Not really, apart from reading source code and figuring out exactly what > happens here. > I guess that we can make things more predictable and easier to read by > shifting parts of the shell library to a C process. > For instance if we wrote an utility that would implement all the > tst_kill_test and timeout process in C we would simplify things greatly. > Something as: > tst_timeout_kill > that would be used as: > tst_timeout_kill 300 12342 > ^ ^ > | process group leader pid > timeout in seconds > That would implement both the loop for killing tests and timeout > processing as well. That way we would get rid of the trap in the > subshell and we would end up with a single pid for the whole timeout > process and avoid the recursive sigkill to begin with. +1 for this idea instead of never ending story to fix various shells. Please if you have time, wrote that. Kind regards, Petr
diff --git a/lib/newlib_tests/shell/test_timeout.sh b/lib/newlib_tests/shell/test_timeout.sh index b05680cb1..c70e22f27 100755 --- a/lib/newlib_tests/shell/test_timeout.sh +++ b/lib/newlib_tests/shell/test_timeout.sh @@ -28,7 +28,7 @@ timeout02.sh| -10|0| |2 timeout02.sh| -0.1|0| |0 timeout02.sh| -1.1|0| |2 timeout02.sh|-10.1|0| |2 -timeout03.sh| |0|12|137| | | |Test kill if test does not terminate by SIGINT +timeout03.sh| |0|12|137| | | |Test is killed, if it does not terminate after SIGTERM timeout04.sh| |0| | 2|0|0|1|Verify that timeout is enforced timeout02.sh| 2|1| 2| |1|0|0|Test termination of timeout process " diff --git a/lib/newlib_tests/shell/timeout03.sh b/lib/newlib_tests/shell/timeout03.sh index cd548d9a2..dabef32a2 100755 --- a/lib/newlib_tests/shell/timeout03.sh +++ b/lib/newlib_tests/shell/timeout03.sh @@ -6,8 +6,9 @@ # expected output: # timeout03 1 TINFO: timeout per run is 0h 0m 1s # timeout03 1 TINFO: testing killing test after TST_TIMEOUT -# timeout03 1 TBROK: Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 -# timeout03 1 TBROK: test interrupted or timed out +# timeout03 1 TBROK: Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 +# Terminated +# timeout03 1 TBROK: test terminated # timeout03 1 TPASS: test run cleanup after timeout # timeout03 1 TINFO: Test is still running, waiting 10s # timeout03 1 TINFO: Test is still running, waiting 9s diff --git a/testcases/lib/tst_test.sh b/testcases/lib/tst_test.sh index 3a5651c01..66ffde4eb 100644 --- a/testcases/lib/tst_test.sh +++ b/testcases/lib/tst_test.sh @@ -21,7 +21,8 @@ export TST_LIB_LOADED=1 . tst_security.sh # default trap function -trap "tst_brk TBROK 'test interrupted or timed out'" INT +trap "tst_brk TBROK 'test interrupted'" INT +trap "unset _tst_setup_timer_pid; tst_brk TBROK 'test terminated'" TERM _tst_do_exit() { @@ -439,9 +440,9 @@ _tst_kill_test() { local i=10 - trap '' INT - tst_res TBROK "Test timeouted, sending SIGINT! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" - kill -INT -$pid + trap '' TERM + tst_res TBROK "Test timed out, sending SIGTERM! If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1" + kill -TERM -$pid tst_sleep 100ms while kill -0 $pid >/dev/null 2>&1 && [ $i -gt 0 ]; do