diff mbox series

[RFCŊ[DO_NOT_MERGE,v2,1/1] netstress: Fix race between SETSID() and exit(0)

Message ID 20220218164845.11501-1-pvorel@suse.cz
State Changes Requested
Headers show
Series [RFCŊ[DO_NOT_MERGE,v2,1/1] netstress: Fix race between SETSID() and exit(0) | expand

Commit Message

Petr Vorel Feb. 18, 2022, 4:48 p.m. UTC
******* DO NOT MERGE *******

There is a race between the SETSID() and exit(0) in move_to_background()
caused by "Killed the leftover descendant processes" introduced in
72b172867 ("Terminate leftover subprocesses when main test process
crashes").

If the main test process calls exit(0) before the newly forked child
managed to do SETSID() it's killed by the test library because it's
still in the old process group. Therefore use TST_CHECKPOINT_{WAIT,WAKE}()
to avoid the race.

Link: https://lore.kernel.org/ltp/Yg+RXbUTOxK56iZa@pevik/

Suggested-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Petr Vorel <pvorel@suse.cz>
---
This patch somehow cause server to think that "client asks to terminate"
server_fn():

void *server_fn(void *cfd)
{
	...
	/* client asks to terminate */
	if (recv_msg[0] == start_fin_byte) {
		tst_res(TINFO, "client asks to terminate");
		goto out;
	}

This is really strange, because because server shouldn't do anything
before exit(), I miss something here.

Kind regards,
Petr

tcp_ipsec 1 TINFO: timeout per run is 0h 5m 0s
tcp_ipsec 1 TINFO: run server 'netstress -D ltp_ns_veth1 -R 10 -B /tmp/LTP_tcp_ipsec.9uqaI9HX3i'
tcp_ipsec 1 TINFO: run client 'netstress -l -H 10.0.0.1 -n 100 -N 100 -D ltp_ns_veth2 -a 2 -r 100 -d tst_netload.res' 5 times

=> PROBLEM HERE
netstress.c:588: TINFO: client asks to terminate
netstress.c:644: TBROK: Server closed

tst_test.c:1457: TINFO: Timeout per run is 0h 05m 00s
netstress.c:901: TINFO: connection: addr '10.0.0.1', port '34989'
netstress.c:902: TINFO: client max req: 100
netstress.c:903: TINFO: clients num: 2
netstress.c:908: TINFO: client msg size: 100
netstress.c:909: TINFO: server msg size: 100
netstress.c:823: TINFO: tcp_tw_reuse is already set
netstress.c:953: TINFO: TCP client is using old TCP API.
netstress.c:795: TINFO: '/proc/sys/net/ipv4/tcp_fastopen' is 1
netstress.c:476: TINFO: Running the test over IPv4
netstress.c:503: TINFO: total time '4' ms
netstress.c:521: TPASS: test completed

Summary:
passed   1
failed   0
broken   0
skipped  0
warnings 0
tcp_ipsec 1 TFAIL: can't read tst_netload.res
tcp_ipsec 1 TINFO: AppArmor enabled, this may affect test results
tcp_ipsec 1 TINFO: it can be disabled with TST_DISABLE_APPARMOR=1 (requires super/root)
tcp_ipsec 1 TINFO: loaded AppArmor profiles: none

 testcases/network/netstress/netstress.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Petr Vorel Feb. 18, 2022, 5:28 p.m. UTC | #1
Hi all,

using TST_CHECKPOINT_{WAIT,WAKE}() somehow causes server to run server_run()
(which shouldn't be run), I have no idea why. Maybe using pthread with it
triggers some library bug which was there but hidden?

Kind regards,
Petr
Cyril Hrubis Feb. 21, 2022, 1:05 p.m. UTC | #2
Hi!
Uff, I did found the root cause after debugging for a while.

The network tests rely a lot on passing data between processes by files
by a local directory and .needs_checkpoints causes the test to run under
a different directory, that is because checkpoints needs a backing file
that is mapped into the process memory. And so after setting
.need_checkpoints the client was storing the file into a different
directory.

This should be a minimal fix:

diff --git a/testcases/lib/tst_net.sh b/testcases/lib/tst_net.sh
index 047686dc3..891472c8a 100644
--- a/testcases/lib/tst_net.sh
+++ b/testcases/lib/tst_net.sh
@@ -715,7 +715,7 @@ tst_netload()
        fi

        s_opts="${cs_opts}${s_opts}-R $s_replies -B $TST_TMPDIR"
-       c_opts="${cs_opts}${c_opts}-a $c_num -r $((c_requests / run_cnt)) -d $rfile"
+       c_opts="${cs_opts}${c_opts}-a $c_num -r $((c_requests / run_cnt)) -d $PWD/$rfile"

        tst_res_ TINFO "run server 'netstress $s_opts'"
        tst_res_ TINFO "run client 'netstress -l $c_opts' $run_cnt times"



However the debugging took longer than I wanted to since the network
tests are such a mess. The server does exit by TBROK (which looks like
it's an expected behavior), only half of the sever log is printed on a
failure, etc. These should really deserve some cleanups...
Petr Vorel Feb. 21, 2022, 1:13 p.m. UTC | #3
Hi Cyril,

> Hi!
> Uff, I did found the root cause after debugging for a while.
Thanks a lot!

> The network tests rely a lot on passing data between processes by files
> by a local directory and .needs_checkpoints causes the test to run under
> a different directory, that is because checkpoints needs a backing file
> that is mapped into the process memory. And so after setting
> .need_checkpoints the client was storing the file into a different
> directory.

> This should be a minimal fix:

> diff --git a/testcases/lib/tst_net.sh b/testcases/lib/tst_net.sh
> index 047686dc3..891472c8a 100644
> --- a/testcases/lib/tst_net.sh
> +++ b/testcases/lib/tst_net.sh
> @@ -715,7 +715,7 @@ tst_netload()
>         fi

>         s_opts="${cs_opts}${s_opts}-R $s_replies -B $TST_TMPDIR"
> -       c_opts="${cs_opts}${c_opts}-a $c_num -r $((c_requests / run_cnt)) -d $rfile"
> +       c_opts="${cs_opts}${c_opts}-a $c_num -r $((c_requests / run_cnt)) -d $PWD/$rfile"

>         tst_res_ TINFO "run server 'netstress $s_opts'"
>         tst_res_ TINFO "run client 'netstress -l $c_opts' $run_cnt times"


Yes, this looks like enough. Do you want me to merge this proposal with added
this change? Or you send a patch or just merge fix yourself?

> However the debugging took longer than I wanted to since the network
> tests are such a mess. The server does exit by TBROK (which looks like
> it's an expected behavior), only half of the sever log is printed on a
> failure, etc. These should really deserve some cleanups...
I'd say specifically tst_netload() (in tst_net.sh) and netstress.c deserve
cleanup. Also, as we noticed several times shell tests tends to be buggy,
specially in combination with C tests. But not sure if feasible to write
everything in C.

Kind regards,
Petr
Cyril Hrubis Feb. 21, 2022, 1:26 p.m. UTC | #4
Hi!
> >         tst_res_ TINFO "run server 'netstress $s_opts'"
> >         tst_res_ TINFO "run client 'netstress -l $c_opts' $run_cnt times"
> 
> 
> Yes, this looks like enough. Do you want me to merge this proposal with added
> this change? Or you send a patch or just merge fix yourself?

Just merge it all into a single patch.

> > However the debugging took longer than I wanted to since the network
> > tests are such a mess. The server does exit by TBROK (which looks like
> > it's an expected behavior), only half of the sever log is printed on a
> > failure, etc. These should really deserve some cleanups...
> I'd say specifically tst_netload() (in tst_net.sh) and netstress.c deserve
> cleanup. Also, as we noticed several times shell tests tends to be buggy,
> specially in combination with C tests. But not sure if feasible to write
> everything in C.

I guess that we can do some minor fixes, but the whole codebase looks
like it needs to be rethinked and redesigned...
Petr Vorel Feb. 21, 2022, 1:36 p.m. UTC | #5
Hi Cyril,

merged, thanks!

> > Yes, this looks like enough. Do you want me to merge this proposal with added
> > this change? Or you send a patch or just merge fix yourself?

> Just merge it all into a single patch.
Sure this was to go to single commit, just wasn't sure if you want to merge it.

> > > However the debugging took longer than I wanted to since the network
> > > tests are such a mess. The server does exit by TBROK (which looks like
> > > it's an expected behavior), only half of the sever log is printed on a
> > > failure, etc. These should really deserve some cleanups...
> > I'd say specifically tst_netload() (in tst_net.sh) and netstress.c deserve
> > cleanup. Also, as we noticed several times shell tests tends to be buggy,
> > specially in combination with C tests. But not sure if feasible to write
> > everything in C.

> I guess that we can do some minor fixes, but the whole codebase looks
> like it needs to be rethinked and redesigned...
+1

Kind regards,
Petr
diff mbox series

Patch

diff --git a/testcases/network/netstress/netstress.c b/testcases/network/netstress/netstress.c
index 0914c65bd4..6c9e83112e 100644
--- a/testcases/network/netstress/netstress.c
+++ b/testcases/network/netstress/netstress.c
@@ -713,11 +713,15 @@  static void server_cleanup(void)
 
 static void move_to_background(void)
 {
-	if (SAFE_FORK())
+	if (SAFE_FORK()) {
+		TST_CHECKPOINT_WAIT(0);
 		exit(0);
+	}
 
 	SAFE_SETSID();
 
+	TST_CHECKPOINT_WAKE(0);
+
 	close(STDIN_FILENO);
 	SAFE_OPEN("/dev/null", O_RDONLY);
 	close(STDOUT_FILENO);
@@ -1024,4 +1028,5 @@  static struct tst_test test = {
 		{"B:", &server_bg, "Run in background, arg is the process directory"},
 		{}
 	},
+	.needs_checkpoints = 1,
 };