[v3,2/3] lib/tst_test.c: Update result counters when calling tst_brk()

Message ID 1544690160-13900-2-git-send-email-yangx.jy@cn.fujitsu.com
State New
Headers show
Series
  • [v3,1/3] lib: Introduce tst_strttype()
Related show

Commit Message

Xiao Yang Dec. 13, 2018, 8:35 a.m.
1) Catch and report the TFAIL exit status of child process.
2) Only update result counters in library process and main test
   process because the exit status of child can be reported by
   main test process.
3) Print TCONF message and increase skipped when calling tst_brk(TCONF).
   Print TBROK message and increase broken when calling tst_brk(TBROK).
   Print TFAIL message and increase failed when calling tst_brk(TFAIL).
4) Remove duplicate update_results() in run_tcases_per_fs().

Fixes: #408

Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
---
 lib/tst_test.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

Comments

Cyril Hrubis Jan. 7, 2019, 3:06 p.m. | #1
Hi!
> 1) Catch and report the TFAIL exit status of child process.

Looking at the codebase we do have a few usages of tst_brk(TFAIL, "...")
to exit the child process, which sort of works but it's incorrect. The
tst_brk() always meant "unrecoverable failure have happened, exit the
current process as fast as possible". Looking over our codebase most of
the tst_brk(TFAIL, "...") should not actually cause the main test
process to exit, these were only meant to exit the child and report the
result in one call. It will for instance break the test with -i option
on the first failure, which is incorrect.

So if we ever want to have a function to exit child process with a result we
should implement tst_ret() that would be equivalent to tst_res() followed by
exit(0).

It could be even implemented as:

#define tst_ret(ttype, fmt, ...) \
	do { \
		tst_res_(__FILE__, __LINE__, (ttype), (fmt), ##__VA_ARGS__); \
		exit(0); \
	} while (0)

This function has one big advantage, it increments the results counters
before the child process exits.

Actually one of the big points of the new test library was that the
results counters are atomically increased, because passing the results
in exit values is nightmare that cannot be done correclty.

> 2) Only update result counters in library process and main test
>    process because the exit status of child can be reported by
>    main test process.

Actually after I spend some time on it I think that the best solution is
to update the results in the piece of shared memory as fast as possible,
anything else is prone to various races and corner cases.

> 3) Print TCONF message and increase skipped when calling tst_brk(TCONF).
>    Print TBROK message and increase broken when calling tst_brk(TBROK).
>    Print TFAIL message and increase failed when calling tst_brk(TFAIL).
> 4) Remove duplicate update_results() in run_tcases_per_fs().

I've been thinking about this and the problem is more complex, and I'm
even not sure that it's possible to write the library so that the
counters are consistent at the time we exit the test if something
unexpected happened and we called tst_brk().

Consider for instance this example:

#include "tst_test.h"

static void do_test(void)
{
        if (!SAFE_FORK())
                tst_brk(TBROK, "child");
        tst_brk(TBROK, "parent");
}

static struct tst_test test = {
        .test_all = do_test,
        .forks_child = 1,
};

When tst_brk() is called both in parent and child the counter would be
incremented only once because the child is not waited for by the main
test.

We can close this special case by changing the main test pid to wait for the
children before it calls exit() in the tst_brk() but that may cause the
main process to get stuck undefinitely if the child processes get stuck,
so we would have to be careful.

Also from the very definition of the TBROK return status the test
results would be incomplete at best, since TBROK really means
"unrecoverable error happened during the test" which would mostly means
that something as low level as filesystem got corrupted and there is no
point in presenting the results in that case, so I guess that the best
we could do in the case of TBROK is to print big message that says
"things went horribly wrong!" or something similar.

All in all I would like to avoid applying patches to the test library
before we finalize the release, since there is not much time for
testing now.
Jan Stancek Jan. 7, 2019, 5:39 p.m. | #2
----- Original Message -----
> Hi!
> > 1) Catch and report the TFAIL exit status of child process.
> 
> Looking at the codebase we do have a few usages of tst_brk(TFAIL, "...")
> to exit the child process, which sort of works but it's incorrect. The
> tst_brk() always meant "unrecoverable failure have happened, exit the
> current process as fast as possible". Looking over our codebase most of
> the tst_brk(TFAIL, "...") should not actually cause the main test
> process to exit, these were only meant to exit the child and report the
> result in one call. It will for instance break the test with -i option
> on the first failure, which is incorrect.

Nice example, would you care to add that to docs?

> 
> So if we ever want to have a function to exit child process with a result we
> should implement tst_ret() that would be equivalent to tst_res() followed by
> exit(0).
> 
> It could be even implemented as:
> 
> #define tst_ret(ttype, fmt, ...) \
> 	do { \
> 		tst_res_(__FILE__, __LINE__, (ttype), (fmt), ##__VA_ARGS__); \
> 		exit(0); \
> 	} while (0)
> 
> This function has one big advantage, it increments the results counters
> before the child process exits.

If all call-sites switch to tst_ret(), we could add TFAIL to tst_brk
compile time check.

> 
> Actually one of the big points of the new test library was that the
> results counters are atomically increased, because passing the results
> in exit values is nightmare that cannot be done correclty.
> 
> > 2) Only update result counters in library process and main test
> >    process because the exit status of child can be reported by
> >    main test process.
> 
> Actually after I spend some time on it I think that the best solution is
> to update the results in the piece of shared memory as fast as possible,
> anything else is prone to various races and corner cases.

I was thinking this too.

If your parent process happens to wait for the child itself,
then library will never get to see retcode.

Regards,
Jan

> 
> > 3) Print TCONF message and increase skipped when calling tst_brk(TCONF).
> >    Print TBROK message and increase broken when calling tst_brk(TBROK).
> >    Print TFAIL message and increase failed when calling tst_brk(TFAIL).
> > 4) Remove duplicate update_results() in run_tcases_per_fs().
> 
> I've been thinking about this and the problem is more complex, and I'm
> even not sure that it's possible to write the library so that the
> counters are consistent at the time we exit the test if something
> unexpected happened and we called tst_brk().
> 
> Consider for instance this example:
> 
> #include "tst_test.h"
> 
> static void do_test(void)
> {
>         if (!SAFE_FORK())
>                 tst_brk(TBROK, "child");
>         tst_brk(TBROK, "parent");
> }
> 
> static struct tst_test test = {
>         .test_all = do_test,
>         .forks_child = 1,
> };
> 
> When tst_brk() is called both in parent and child the counter would be
> incremented only once because the child is not waited for by the main
> test.
> 
> We can close this special case by changing the main test pid to wait for the
> children before it calls exit() in the tst_brk() but that may cause the
> main process to get stuck undefinitely if the child processes get stuck,
> so we would have to be careful.
> 
> Also from the very definition of the TBROK return status the test
> results would be incomplete at best, since TBROK really means
> "unrecoverable error happened during the test" which would mostly means
> that something as low level as filesystem got corrupted and there is no
> point in presenting the results in that case, so I guess that the best
> we could do in the case of TBROK is to print big message that says
> "things went horribly wrong!" or something similar.
> 
> All in all I would like to avoid applying patches to the test library
> before we finalize the release, since there is not much time for
> testing now.
> 
> --
> Cyril Hrubis
> chrubis@suse.cz
>
Cyril Hrubis Jan. 7, 2019, 6:29 p.m. | #3
Hi!
> > Looking at the codebase we do have a few usages of tst_brk(TFAIL, "...")
> > to exit the child process, which sort of works but it's incorrect. The
> > tst_brk() always meant "unrecoverable failure have happened, exit the
> > current process as fast as possible". Looking over our codebase most of
> > the tst_brk(TFAIL, "...") should not actually cause the main test
> > process to exit, these were only meant to exit the child and report the
> > result in one call. It will for instance break the test with -i option
> > on the first failure, which is incorrect.
> 
> Nice example, would you care to add that to docs?

Good idea.

> > So if we ever want to have a function to exit child process with a result we
> > should implement tst_ret() that would be equivalent to tst_res() followed by
> > exit(0).
> > 
> > It could be even implemented as:
> > 
> > #define tst_ret(ttype, fmt, ...) \
> > 	do { \
> > 		tst_res_(__FILE__, __LINE__, (ttype), (fmt), ##__VA_ARGS__); \
> > 		exit(0); \
> > 	} while (0)
> > 
> > This function has one big advantage, it increments the results counters
> > before the child process exits.
> 
> If all call-sites switch to tst_ret(), we could add TFAIL to tst_brk
> compile time check.

Actually I've started to work on that and all call sites can either be
converted to tst_ret(TFAIL, ...) or tst_brk(TBROK, ...). So it really
looks like we should go this way.

Also I guess that this may be even safe enough to go in before the
release. I will try to finish and post it tomorrow.
Xiao Yang Jan. 8, 2019, 9:08 a.m. | #4
On 2019/01/07 23:06, Cyril Hrubis wrote:
> Hi!
>> 1) Catch and report the TFAIL exit status of child process.
> Looking at the codebase we do have a few usages of tst_brk(TFAIL, "...")
> to exit the child process, which sort of works but it's incorrect. The
> tst_brk() always meant "unrecoverable failure have happened, exit the
> current process as fast as possible". Looking over our codebase most of
> the tst_brk(TFAIL, "...") should not actually cause the main test
> process to exit, these were only meant to exit the child and report the
> result in one call. It will for instance break the test with -i option
> on the first failure, which is incorrect.
Hi Cyril,

Detailed explanation, and i got it.

> So if we ever want to have a function to exit child process with a result we
> should implement tst_ret() that would be equivalent to tst_res() followed by
> exit(0).
>
> It could be even implemented as:
>
> #define tst_ret(ttype, fmt, ...) \
> 	do { \
> 		tst_res_(__FILE__, __LINE__, (ttype), (fmt), ##__VA_ARGS__); \
> 		exit(0); \
> 	} while (0)
>
> This function has one big advantage, it increments the results counters
> before the child process exits.
>
> Actually one of the big points of the new test library was that the
> results counters are atomically increased, because passing the results
> in exit values is nightmare that cannot be done correclty.
Agreed.  All of tst_brk(TFAIL, ...) can be converted to tst_ret(TFAIL, ...) or
tst_brk(TBROK, ...) in this way and then add TFAIL to tst_brk compile time check
as Jan replied, so that only TCONF and TBROK can be passed into tst_brk().

>> 2) Only update result counters in library process and main test
>>     process because the exit status of child can be reported by
>>     main test process.
> Actually after I spend some time on it I think that the best solution is
> to update the results in the piece of shared memory as fast as possible,
> anything else is prone to various races and corner cases.
...

>> 3) Print TCONF message and increase skipped when calling tst_brk(TCONF).
>>     Print TBROK message and increase broken when calling tst_brk(TBROK).
>>     Print TFAIL message and increase failed when calling tst_brk(TFAIL).
>> 4) Remove duplicate update_results() in run_tcases_per_fs().
> I've been thinking about this and the problem is more complex, and I'm
> even not sure that it's possible to write the library so that the
> counters are consistent at the time we exit the test if something
> unexpected happened and we called tst_brk().
>
> Consider for instance this example:
>
> #include "tst_test.h"
>
> static void do_test(void)
> {
>          if (!SAFE_FORK())
>                  tst_brk(TBROK, "child");
>          tst_brk(TBROK, "parent");
> }
>
> static struct tst_test test = {
>          .test_all = do_test,
>          .forks_child = 1,
> };
>
> When tst_brk() is called both in parent and child the counter would be
> incremented only once because the child is not waited for by the main
> test.
>
> We can close this special case by changing the main test pid to wait for the
> children before it calls exit() in the tst_brk() but that may cause the
> main process to get stuck undefinitely if the child processes get stuck,
> so we would have to be careful.
>
> Also from the very definition of the TBROK return status the test
> results would be incomplete at best, since TBROK really means
> "unrecoverable error happened during the test" which would mostly means
> that something as low level as filesystem got corrupted and there is no
> point in presenting the results in that case, so I guess that the best
> we could do in the case of TBROK is to print big message that says
> "things went horribly wrong!" or something similar.
Sorry, my patch is too rough becasue some suitations are not taken into account.
For tst_brk(TCONF), do you mean to replace the current solution using wait() in
check_child_status() with your suggested shared memory?
For tst_brk(TBROK), do you mean to just print big message instead of updating
test results?

> All in all I would like to avoid applying patches to the test library
> before we finalize the release, since there is not much time for
> testing now.
Agreed, drop these patches during the upcoming release.  We still need to do
future investigation and testing.

Best Regards,
Xiao Yang
Cyril Hrubis Jan. 8, 2019, 1:11 p.m. | #5
Hi!
> > So if we ever want to have a function to exit child process with a result we
> > should implement tst_ret() that would be equivalent to tst_res() followed by
> > exit(0).
> > 
> > It could be even implemented as:
> > 
> > #define tst_ret(ttype, fmt, ...) \
> > 	do { \
> > 		tst_res_(__FILE__, __LINE__, (ttype), (fmt), ##__VA_ARGS__); \
> > 		exit(0); \
> > 	} while (0)
> > 
> > This function has one big advantage, it increments the results counters
> > before the child process exits.
> 
> If all call-sites switch to tst_ret(), we could add TFAIL to tst_brk
> compile time check.

And it's even more complicated than this.

The problem is that if we add tst_ret() that calls exit(0) it will have
slightly different semantics for the main test process and for the
children.

The subtle difference is that we run the loop that executes the tests in
the case of the -i parameter or in the case that we defined tcnt is in
the main test process. If we call tst_ret() there it will exit the test
before we manage to run rest of the iterations and I would like to avoid
adding an API that is slightly confusing easy to misuse.

I will try to rethink this, however it looks like there is no silver
bullet to solve the problem. And it's certainly late for the release.

Also I've opened an issue for the problem in order not to forget:

https://github.com/linux-test-project/ltp/issues/462

Patch

diff --git a/lib/tst_test.c b/lib/tst_test.c
index 661fbbf..e46ab8e 100644
--- a/lib/tst_test.c
+++ b/lib/tst_test.c
@@ -55,6 +55,7 @@  struct results {
 	int skipped;
 	int failed;
 	int warnings;
+	int broken;
 	unsigned int timeout;
 };
 
@@ -177,6 +178,9 @@  static void update_results(int ttype)
 	case TFAIL:
 		tst_atomic_inc(&results->failed);
 	break;
+	case TBROK:
+		tst_atomic_inc(&results->broken);
+	break;
 	}
 }
 
@@ -305,11 +309,15 @@  void tst_vbrk_(const char *file, const int lineno, int ttype,
 	 * specified but CLONE_THREAD is not. Use direct syscall to avoid
 	 * cleanup running in the child.
 	 */
-	if (syscall(SYS_getpid) == main_pid)
+	if (syscall(SYS_getpid) == main_pid) {
+		update_results(TTYPE_RESULT(ttype));
 		do_test_cleanup();
+	}
 
-	if (getpid() == lib_pid)
+	if (getpid() == lib_pid) {
+		update_results(TTYPE_RESULT(ttype));
 		do_exit(TTYPE_RESULT(ttype));
+	}
 
 	exit(TTYPE_RESULT(ttype));
 }
@@ -350,6 +358,7 @@  static void check_child_status(pid_t pid, int status)
 	switch (ret) {
 	case TPASS:
 	break;
+	case TFAIL:
 	case TBROK:
 	case TCONF:
 		tst_brk(ret, "Reported by child (%i)", pid);
@@ -591,6 +600,7 @@  static void do_exit(int ret)
 		printf("failed   %d\n", results->failed);
 		printf("skipped  %d\n", results->skipped);
 		printf("warnings %d\n", results->warnings);
+		printf("broken   %d\n", results->broken);
 
 		if (results->passed && ret == TCONF)
 			ret = 0;
@@ -603,6 +613,9 @@  static void do_exit(int ret)
 
 		if (results->warnings)
 			ret |= TWARN;
+
+		if (results->broken)
+			ret |= TBROK;
 	}
 
 	do_cleanup();
@@ -1155,12 +1168,7 @@  static int run_tcases_per_fs(void)
 			mntpoint_mounted = 0;
 		}
 
-		if (ret == TCONF) {
-			update_results(ret);
-			continue;
-		}
-
-		if (ret == 0)
+		if (ret == TCONF || ret == 0)
 			continue;
 
 		do_exit(ret);