Message ID | 20220213042836.3028266-1-liwang@redhat.com |
---|---|
State | Superseded |
Headers | show |
Series | test_children_cleanup: allow child zombied for a while | expand |
Hi Li, > Zombie process is already dead after SIGKILL is processed by the kernel, > there's usually a kernel reason (similar to being "blocked" waiting on > a syscall to finish) for the process not terminating. > Due to this child having been moved under PID 1(init), there is no waitpid() > guarantee of reaping it anymore. It completely depends on PID 1(init) reclaims > zombie process at its own pace. > So here allow the child exist in a zombie state if PID 1(init) does not > reclaim resource and update the process table instantly. > Failed CI: > https://github.com/linux-test-project/ltp/runs/5171298664?check_suite_focus=true Good catch! ... > runtest TINFO: * test_children_cleanup.sh > TFAIL: Child process was left behind > cat /proc/69523/status > Name: test_children_c > State: Z (zombie) > Tgid: 69523 > Ngid: 0 > Pid: 69523 > PPid: 1 ... > +++ b/lib/newlib_tests/test_children_cleanup.sh > @@ -13,7 +13,13 @@ if [ "x$CHILD_PID" = "x" ]; then > elif ! kill -s 0 $CHILD_PID &>/dev/null; then > echo "TPASS: Child process was cleaned up" > exit 0 > +elif grep -q -E "Z|zombie" /proc/$CHILD_PID/status; then task_state_array[] in fs/proc/array.c has "Z (zombie)" Is there a reason to grep just "Z"? Because -E "Z|zombie" matches just "Z" and when we don't grep for "State:" we can likely search for different result. "Z (zombie)" is here form kernel git beginning (2.6.12-rc2), we should match it. Kind regards, Petr
On 13. 02. 22 5:28, Li Wang wrote: > diff --git a/lib/newlib_tests/test_children_cleanup.sh b/lib/newlib_tests/test_children_cleanup.sh > index 4b4e8b2f0..ec1a0d4fe 100755 > --- a/lib/newlib_tests/test_children_cleanup.sh > +++ b/lib/newlib_tests/test_children_cleanup.sh > @@ -13,7 +13,13 @@ if [ "x$CHILD_PID" = "x" ]; then > elif ! kill -s 0 $CHILD_PID &>/dev/null; then > echo "TPASS: Child process was cleaned up" > exit 0 > +elif grep -q -E "Z|zombie" /proc/$CHILD_PID/status; then > + echo "TPASS: Child process was in zombie state" > + exit 0 We're in a race condition here either way so reading the status procfile after checking whether the process still exists can result in failure even when the child was properly killed. I wrongly believed that `kill -s 0` would fail when the target process is a zombie because the manpage vaguely suggests that (see the description of ESRCH errno in the kill(2) manpage) but it turns out I was wrong again. I'll send a fix myself later today.
On Mon, Feb 14, 2022 at 6:40 PM Martin Doucha <mdoucha@suse.cz> wrote: > On 13. 02. 22 5:28, Li Wang wrote: > > diff --git a/lib/newlib_tests/test_children_cleanup.sh > b/lib/newlib_tests/test_children_cleanup.sh > > index 4b4e8b2f0..ec1a0d4fe 100755 > > --- a/lib/newlib_tests/test_children_cleanup.sh > > +++ b/lib/newlib_tests/test_children_cleanup.sh > > @@ -13,7 +13,13 @@ if [ "x$CHILD_PID" = "x" ]; then > > elif ! kill -s 0 $CHILD_PID &>/dev/null; then > > echo "TPASS: Child process was cleaned up" > > exit 0 > > +elif grep -q -E "Z|zombie" /proc/$CHILD_PID/status; then > > + echo "TPASS: Child process was in zombie state" > > + exit 0 > > We're in a race condition here either way so reading the status procfile > after checking whether the process still exists can result in failure > even when the child was properly killed. I wrongly believed that > Ah yes. It is still possible to complete the zombie reclaim just before doing the grep. > `kill -s 0` would fail when the target process is a zombie because the > manpage vaguely suggests that (see the description of ESRCH errno in the > kill(2) manpage) but it turns out I was wrong again. > > I'll send a fix myself later today. > Thanks, I look forward to a better solution.
> > +elif grep -q -E "Z|zombie" /proc/$CHILD_PID/status; then > > task_state_array[] in fs/proc/array.c has "Z (zombie)" > Is there a reason to grep just "Z"? Because -E "Z|zombie" matches just "Z" > and > when we don't grep for "State:" we can likely search for different result. > "Z (zombie)" is here form kernel git beginning (2.6.12-rc2), we should > match it. > Thanks for pointing out this, but it seems this fix-way will be superseded.
diff --git a/lib/newlib_tests/test_children_cleanup.sh b/lib/newlib_tests/test_children_cleanup.sh index 4b4e8b2f0..ec1a0d4fe 100755 --- a/lib/newlib_tests/test_children_cleanup.sh +++ b/lib/newlib_tests/test_children_cleanup.sh @@ -13,7 +13,13 @@ if [ "x$CHILD_PID" = "x" ]; then elif ! kill -s 0 $CHILD_PID &>/dev/null; then echo "TPASS: Child process was cleaned up" exit 0 +elif grep -q -E "Z|zombie" /proc/$CHILD_PID/status; then + echo "TPASS: Child process was in zombie state" + exit 0 else echo "TFAIL: Child process was left behind" + echo "cat /proc/$CHILD_PID/status" + echo "---------------------------" + cat /proc/$CHILD_PID/status exit 1 fi
Zombie process is already dead after SIGKILL is processed by the kernel, there's usually a kernel reason (similar to being "blocked" waiting on a syscall to finish) for the process not terminating. Due to this child having been moved under PID 1(init), there is no waitpid() guarantee of reaping it anymore. It completely depends on PID 1(init) reclaims zombie process at its own pace. So here allow the child exist in a zombie state if PID 1(init) does not reclaim resource and update the process table instantly. Failed CI: https://github.com/linux-test-project/ltp/runs/5171298664?check_suite_focus=true ------------------- runtest TINFO: * test_children_cleanup tst_test.c:1455: TINFO: Timeout per run is 0h 00m 10s test_children_cleanup.c:20: TINFO: Main process 69516 starting test_children_cleanup.c:35: TINFO: Forked child 69518 tst_test.c:1500: TINFO: Killed the leftover descendant processes tst_test.c:1506: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 tst_test.c:1508: TBROK: Test killed! (timeout?) ------------------- runtest TINFO: * test_children_cleanup.sh TFAIL: Child process was left behind cat /proc/69523/status Name: test_children_c State: Z (zombie) Tgid: 69523 Ngid: 0 Pid: 69523 PPid: 1 ... New test CI: https://github.com/wangli5665/ltp/runs/5171508466?check_suite_focus=true Also, add some debug codes if test fails. Signed-off-by: Li Wang <liwang@redhat.com> Cc: Martin Doucha <mdoucha@suse.cz> --- lib/newlib_tests/test_children_cleanup.sh | 6 ++++++ 1 file changed, 6 insertions(+)