Message ID | 20220218164845.11501-1-pvorel@suse.cz |
---|---|
State | Changes Requested |
Headers | show |
Series | [RFCŊ[DO_NOT_MERGE,v2,1/1] netstress: Fix race between SETSID() and exit(0) | expand |
Hi all, using TST_CHECKPOINT_{WAIT,WAKE}() somehow causes server to run server_run() (which shouldn't be run), I have no idea why. Maybe using pthread with it triggers some library bug which was there but hidden? Kind regards, Petr
Hi! Uff, I did found the root cause after debugging for a while. The network tests rely a lot on passing data between processes by files by a local directory and .needs_checkpoints causes the test to run under a different directory, that is because checkpoints needs a backing file that is mapped into the process memory. And so after setting .need_checkpoints the client was storing the file into a different directory. This should be a minimal fix: diff --git a/testcases/lib/tst_net.sh b/testcases/lib/tst_net.sh index 047686dc3..891472c8a 100644 --- a/testcases/lib/tst_net.sh +++ b/testcases/lib/tst_net.sh @@ -715,7 +715,7 @@ tst_netload() fi s_opts="${cs_opts}${s_opts}-R $s_replies -B $TST_TMPDIR" - c_opts="${cs_opts}${c_opts}-a $c_num -r $((c_requests / run_cnt)) -d $rfile" + c_opts="${cs_opts}${c_opts}-a $c_num -r $((c_requests / run_cnt)) -d $PWD/$rfile" tst_res_ TINFO "run server 'netstress $s_opts'" tst_res_ TINFO "run client 'netstress -l $c_opts' $run_cnt times" However the debugging took longer than I wanted to since the network tests are such a mess. The server does exit by TBROK (which looks like it's an expected behavior), only half of the sever log is printed on a failure, etc. These should really deserve some cleanups...
Hi Cyril, > Hi! > Uff, I did found the root cause after debugging for a while. Thanks a lot! > The network tests rely a lot on passing data between processes by files > by a local directory and .needs_checkpoints causes the test to run under > a different directory, that is because checkpoints needs a backing file > that is mapped into the process memory. And so after setting > .need_checkpoints the client was storing the file into a different > directory. > This should be a minimal fix: > diff --git a/testcases/lib/tst_net.sh b/testcases/lib/tst_net.sh > index 047686dc3..891472c8a 100644 > --- a/testcases/lib/tst_net.sh > +++ b/testcases/lib/tst_net.sh > @@ -715,7 +715,7 @@ tst_netload() > fi > s_opts="${cs_opts}${s_opts}-R $s_replies -B $TST_TMPDIR" > - c_opts="${cs_opts}${c_opts}-a $c_num -r $((c_requests / run_cnt)) -d $rfile" > + c_opts="${cs_opts}${c_opts}-a $c_num -r $((c_requests / run_cnt)) -d $PWD/$rfile" > tst_res_ TINFO "run server 'netstress $s_opts'" > tst_res_ TINFO "run client 'netstress -l $c_opts' $run_cnt times" Yes, this looks like enough. Do you want me to merge this proposal with added this change? Or you send a patch or just merge fix yourself? > However the debugging took longer than I wanted to since the network > tests are such a mess. The server does exit by TBROK (which looks like > it's an expected behavior), only half of the sever log is printed on a > failure, etc. These should really deserve some cleanups... I'd say specifically tst_netload() (in tst_net.sh) and netstress.c deserve cleanup. Also, as we noticed several times shell tests tends to be buggy, specially in combination with C tests. But not sure if feasible to write everything in C. Kind regards, Petr
Hi! > > tst_res_ TINFO "run server 'netstress $s_opts'" > > tst_res_ TINFO "run client 'netstress -l $c_opts' $run_cnt times" > > > Yes, this looks like enough. Do you want me to merge this proposal with added > this change? Or you send a patch or just merge fix yourself? Just merge it all into a single patch. > > However the debugging took longer than I wanted to since the network > > tests are such a mess. The server does exit by TBROK (which looks like > > it's an expected behavior), only half of the sever log is printed on a > > failure, etc. These should really deserve some cleanups... > I'd say specifically tst_netload() (in tst_net.sh) and netstress.c deserve > cleanup. Also, as we noticed several times shell tests tends to be buggy, > specially in combination with C tests. But not sure if feasible to write > everything in C. I guess that we can do some minor fixes, but the whole codebase looks like it needs to be rethinked and redesigned...
Hi Cyril, merged, thanks! > > Yes, this looks like enough. Do you want me to merge this proposal with added > > this change? Or you send a patch or just merge fix yourself? > Just merge it all into a single patch. Sure this was to go to single commit, just wasn't sure if you want to merge it. > > > However the debugging took longer than I wanted to since the network > > > tests are such a mess. The server does exit by TBROK (which looks like > > > it's an expected behavior), only half of the sever log is printed on a > > > failure, etc. These should really deserve some cleanups... > > I'd say specifically tst_netload() (in tst_net.sh) and netstress.c deserve > > cleanup. Also, as we noticed several times shell tests tends to be buggy, > > specially in combination with C tests. But not sure if feasible to write > > everything in C. > I guess that we can do some minor fixes, but the whole codebase looks > like it needs to be rethinked and redesigned... +1 Kind regards, Petr
diff --git a/testcases/network/netstress/netstress.c b/testcases/network/netstress/netstress.c index 0914c65bd4..6c9e83112e 100644 --- a/testcases/network/netstress/netstress.c +++ b/testcases/network/netstress/netstress.c @@ -713,11 +713,15 @@ static void server_cleanup(void) static void move_to_background(void) { - if (SAFE_FORK()) + if (SAFE_FORK()) { + TST_CHECKPOINT_WAIT(0); exit(0); + } SAFE_SETSID(); + TST_CHECKPOINT_WAKE(0); + close(STDIN_FILENO); SAFE_OPEN("/dev/null", O_RDONLY); close(STDOUT_FILENO); @@ -1024,4 +1028,5 @@ static struct tst_test test = { {"B:", &server_bg, "Run in background, arg is the process directory"}, {} }, + .needs_checkpoints = 1, };
******* DO NOT MERGE ******* There is a race between the SETSID() and exit(0) in move_to_background() caused by "Killed the leftover descendant processes" introduced in 72b172867 ("Terminate leftover subprocesses when main test process crashes"). If the main test process calls exit(0) before the newly forked child managed to do SETSID() it's killed by the test library because it's still in the old process group. Therefore use TST_CHECKPOINT_{WAIT,WAKE}() to avoid the race. Link: https://lore.kernel.org/ltp/Yg+RXbUTOxK56iZa@pevik/ Suggested-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Petr Vorel <pvorel@suse.cz> --- This patch somehow cause server to think that "client asks to terminate" server_fn(): void *server_fn(void *cfd) { ... /* client asks to terminate */ if (recv_msg[0] == start_fin_byte) { tst_res(TINFO, "client asks to terminate"); goto out; } This is really strange, because because server shouldn't do anything before exit(), I miss something here. Kind regards, Petr tcp_ipsec 1 TINFO: timeout per run is 0h 5m 0s tcp_ipsec 1 TINFO: run server 'netstress -D ltp_ns_veth1 -R 10 -B /tmp/LTP_tcp_ipsec.9uqaI9HX3i' tcp_ipsec 1 TINFO: run client 'netstress -l -H 10.0.0.1 -n 100 -N 100 -D ltp_ns_veth2 -a 2 -r 100 -d tst_netload.res' 5 times => PROBLEM HERE netstress.c:588: TINFO: client asks to terminate netstress.c:644: TBROK: Server closed tst_test.c:1457: TINFO: Timeout per run is 0h 05m 00s netstress.c:901: TINFO: connection: addr '10.0.0.1', port '34989' netstress.c:902: TINFO: client max req: 100 netstress.c:903: TINFO: clients num: 2 netstress.c:908: TINFO: client msg size: 100 netstress.c:909: TINFO: server msg size: 100 netstress.c:823: TINFO: tcp_tw_reuse is already set netstress.c:953: TINFO: TCP client is using old TCP API. netstress.c:795: TINFO: '/proc/sys/net/ipv4/tcp_fastopen' is 1 netstress.c:476: TINFO: Running the test over IPv4 netstress.c:503: TINFO: total time '4' ms netstress.c:521: TPASS: test completed Summary: passed 1 failed 0 broken 0 skipped 0 warnings 0 tcp_ipsec 1 TFAIL: can't read tst_netload.res tcp_ipsec 1 TINFO: AppArmor enabled, this may affect test results tcp_ipsec 1 TINFO: it can be disabled with TST_DISABLE_APPARMOR=1 (requires super/root) tcp_ipsec 1 TINFO: loaded AppArmor profiles: none testcases/network/netstress/netstress.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)