diff mbox series

test_children_cleanup: allow child zombied for a while

Message ID 20220213042836.3028266-1-liwang@redhat.com
State Superseded
Headers show
Series test_children_cleanup: allow child zombied for a while | expand

Commit Message

Li Wang Feb. 13, 2022, 4:28 a.m. UTC
Zombie process is already dead after SIGKILL is processed by the kernel,
there's usually a kernel reason (similar to being "blocked" waiting on
a syscall to finish) for the process not terminating.

Due to this child having been moved under PID 1(init), there is no waitpid()
guarantee of reaping it anymore. It completely depends on PID 1(init) reclaims
zombie process at its own pace.

So here allow the child exist in a zombie state if PID 1(init) does not
reclaim resource and update the process table instantly.

Failed CI:
  https://github.com/linux-test-project/ltp/runs/5171298664?check_suite_focus=true

  -------------------
  runtest TINFO: * test_children_cleanup
  tst_test.c:1455: TINFO: Timeout per run is 0h 00m 10s
  test_children_cleanup.c:20: TINFO: Main process 69516 starting
  test_children_cleanup.c:35: TINFO: Forked child 69518
  tst_test.c:1500: TINFO: Killed the leftover descendant processes
  tst_test.c:1506: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
  tst_test.c:1508: TBROK: Test killed! (timeout?)

  -------------------
  runtest TINFO: * test_children_cleanup.sh
  TFAIL: Child process was left behind
  cat /proc/69523/status
  Name:	test_children_c
  State:	Z (zombie)
  Tgid:	69523
  Ngid:	0
  Pid:	69523
  PPid:	1
  ...

New test CI:
  https://github.com/wangli5665/ltp/runs/5171508466?check_suite_focus=true

Also, add some debug codes if test fails.

Signed-off-by: Li Wang <liwang@redhat.com>
Cc: Martin Doucha <mdoucha@suse.cz>
---
 lib/newlib_tests/test_children_cleanup.sh | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Petr Vorel Feb. 13, 2022, 4:57 p.m. UTC | #1
Hi Li,

> Zombie process is already dead after SIGKILL is processed by the kernel,
> there's usually a kernel reason (similar to being "blocked" waiting on
> a syscall to finish) for the process not terminating.

> Due to this child having been moved under PID 1(init), there is no waitpid()
> guarantee of reaping it anymore. It completely depends on PID 1(init) reclaims
> zombie process at its own pace.

> So here allow the child exist in a zombie state if PID 1(init) does not
> reclaim resource and update the process table instantly.

> Failed CI:
>   https://github.com/linux-test-project/ltp/runs/5171298664?check_suite_focus=true

Good catch!

...
>   runtest TINFO: * test_children_cleanup.sh
>   TFAIL: Child process was left behind
>   cat /proc/69523/status
>   Name:	test_children_c
>   State:	Z (zombie)
>   Tgid:	69523
>   Ngid:	0
>   Pid:	69523
>   PPid:	1

...
> +++ b/lib/newlib_tests/test_children_cleanup.sh
> @@ -13,7 +13,13 @@ if [ "x$CHILD_PID" = "x" ]; then
>  elif ! kill -s 0 $CHILD_PID &>/dev/null; then
>  	echo "TPASS: Child process was cleaned up"
>  	exit 0
> +elif grep -q -E "Z|zombie" /proc/$CHILD_PID/status; then

task_state_array[] in fs/proc/array.c has "Z (zombie)"
Is there a reason to grep just "Z"? Because -E "Z|zombie" matches just "Z" and
when we don't grep for "State:" we can likely search for different result.
"Z (zombie)" is here form kernel git beginning (2.6.12-rc2), we should match it.

Kind regards,
Petr
Martin Doucha Feb. 14, 2022, 10:40 a.m. UTC | #2
On 13. 02. 22 5:28, Li Wang wrote:
> diff --git a/lib/newlib_tests/test_children_cleanup.sh b/lib/newlib_tests/test_children_cleanup.sh
> index 4b4e8b2f0..ec1a0d4fe 100755
> --- a/lib/newlib_tests/test_children_cleanup.sh
> +++ b/lib/newlib_tests/test_children_cleanup.sh
> @@ -13,7 +13,13 @@ if [ "x$CHILD_PID" = "x" ]; then
>  elif ! kill -s 0 $CHILD_PID &>/dev/null; then
>  	echo "TPASS: Child process was cleaned up"
>  	exit 0
> +elif grep -q -E "Z|zombie" /proc/$CHILD_PID/status; then
> +       echo "TPASS: Child process was in zombie state"
> +       exit 0

We're in a race condition here either way so reading the status procfile
after checking whether the process still exists can result in failure
even when the child was properly killed. I wrongly believed that
`kill -s 0` would fail when the target process is a zombie because the
manpage vaguely suggests that (see the description of ESRCH errno in the
kill(2) manpage) but it turns out I was wrong again.

I'll send a fix myself later today.
Li Wang Feb. 14, 2022, 11:18 a.m. UTC | #3
On Mon, Feb 14, 2022 at 6:40 PM Martin Doucha <mdoucha@suse.cz> wrote:

> On 13. 02. 22 5:28, Li Wang wrote:
> > diff --git a/lib/newlib_tests/test_children_cleanup.sh
> b/lib/newlib_tests/test_children_cleanup.sh
> > index 4b4e8b2f0..ec1a0d4fe 100755
> > --- a/lib/newlib_tests/test_children_cleanup.sh
> > +++ b/lib/newlib_tests/test_children_cleanup.sh
> > @@ -13,7 +13,13 @@ if [ "x$CHILD_PID" = "x" ]; then
> >  elif ! kill -s 0 $CHILD_PID &>/dev/null; then
> >       echo "TPASS: Child process was cleaned up"
> >       exit 0
> > +elif grep -q -E "Z|zombie" /proc/$CHILD_PID/status; then
> > +       echo "TPASS: Child process was in zombie state"
> > +       exit 0
>
> We're in a race condition here either way so reading the status procfile
> after checking whether the process still exists can result in failure
> even when the child was properly killed. I wrongly believed that
>

Ah yes. It is still possible to complete the zombie reclaim just before
doing the grep.



> `kill -s 0` would fail when the target process is a zombie because the
> manpage vaguely suggests that (see the description of ESRCH errno in the
> kill(2) manpage) but it turns out I was wrong again.
>
> I'll send a fix myself later today.
>

Thanks, I look forward to a better solution.
Li Wang Feb. 14, 2022, 11:21 a.m. UTC | #4
> > +elif grep -q -E "Z|zombie" /proc/$CHILD_PID/status; then
>
> task_state_array[] in fs/proc/array.c has "Z (zombie)"
> Is there a reason to grep just "Z"? Because -E "Z|zombie" matches just "Z"
> and
> when we don't grep for "State:" we can likely search for different result.
> "Z (zombie)" is here form kernel git beginning (2.6.12-rc2), we should
> match it.
>

Thanks for pointing out this, but it seems this fix-way will be superseded.
diff mbox series

Patch

diff --git a/lib/newlib_tests/test_children_cleanup.sh b/lib/newlib_tests/test_children_cleanup.sh
index 4b4e8b2f0..ec1a0d4fe 100755
--- a/lib/newlib_tests/test_children_cleanup.sh
+++ b/lib/newlib_tests/test_children_cleanup.sh
@@ -13,7 +13,13 @@  if [ "x$CHILD_PID" = "x" ]; then
 elif ! kill -s 0 $CHILD_PID &>/dev/null; then
 	echo "TPASS: Child process was cleaned up"
 	exit 0
+elif grep -q -E "Z|zombie" /proc/$CHILD_PID/status; then
+       echo "TPASS: Child process was in zombie state"
+       exit 0
 else
 	echo "TFAIL: Child process was left behind"
+	echo "cat /proc/$CHILD_PID/status"
+	echo "---------------------------"
+	cat /proc/$CHILD_PID/status
 	exit 1
 fi