diff mbox series

[v2,for-3.0] tests/libqtest: Improve kill_qemu()

Message ID 20180723193530.20891-1-eblake@redhat.com
State New
Headers show
Series [v2,for-3.0] tests/libqtest: Improve kill_qemu() | expand

Commit Message

Eric Blake July 23, 2018, 7:35 p.m. UTC
In kill_qemu() we have an assert that checks that the QEMU process
didn't dump core:
            assert(!WCOREDUMP(wstatus));

Unfortunately the WCOREDUMP macro here means the resulting message
is not very easy to comprehend on at least some systems:

ahci-test: tests/libqtest.c:113: kill_qemu: Assertion `!(((__extension__ (((union { __typeof(wstatus) __in; int __i; }) { .__in = (wstatus) }).__i))) & 0x80)' failed.

and it doesn't identify what signal the process took.

Furthermore, we are NOT detecting EINTR (while EINTR shouldn't be
happening if we didn't install signal handlers, it's still better
to always be robust), and also want to log unexpected non-zero status
that was not accompanied by a core dump.

Instead of using a raw assert, print the information in an
easier to understand way:

/i386/ahci/sanity: tests/libqtest.c:119: kill_qemu() detected QEMU death with core dump from signal 11 (Segmentation fault)
Aborted (core dumped)

(Of course, the really useful information would be why the QEMU
process dumped core in the first place, but we don't have that
by the time the test program has picked up the exit status.)

Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Eric Blake <eblake@redhat.com>
---

I've taken the ideas from Peter's patch:
https://lists.gnu.org/archive/html/qemu-devel/2018-07/msg04430.html
as well as fixing a related issue brought up last time this was touched:
https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg05710.html

 tests/libqtest.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

Comments

Richard Henderson July 23, 2018, 8:20 p.m. UTC | #1
On 07/23/2018 12:35 PM, Eric Blake wrote:
> In kill_qemu() we have an assert that checks that the QEMU process
> didn't dump core:
>             assert(!WCOREDUMP(wstatus));
> 
> Unfortunately the WCOREDUMP macro here means the resulting message
> is not very easy to comprehend on at least some systems:
> 
> ahci-test: tests/libqtest.c:113: kill_qemu: Assertion `!(((__extension__ (((union { __typeof(wstatus) __in; int __i; }) { .__in = (wstatus) }).__i))) & 0x80)' failed.
> 
> and it doesn't identify what signal the process took.
> 
> Furthermore, we are NOT detecting EINTR (while EINTR shouldn't be
> happening if we didn't install signal handlers, it's still better
> to always be robust), and also want to log unexpected non-zero status
> that was not accompanied by a core dump.
> 
> Instead of using a raw assert, print the information in an
> easier to understand way:
> 
> /i386/ahci/sanity: tests/libqtest.c:119: kill_qemu() detected QEMU death with core dump from signal 11 (Segmentation fault)
> Aborted (core dumped)
> 
> (Of course, the really useful information would be why the QEMU
> process dumped core in the first place, but we don't have that
> by the time the test program has picked up the exit status.)
> 
> Suggested-by: Peter Maydell <peter.maydell@linaro.org>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> 
> I've taken the ideas from Peter's patch:
> https://lists.gnu.org/archive/html/qemu-devel/2018-07/msg04430.html
> as well as fixing a related issue brought up last time this was touched:
> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg05710.html
> 
>  tests/libqtest.c | 39 +++++++++++++++++++++++++++++++++++++--
>  1 file changed, 37 insertions(+), 2 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~
Markus Armbruster July 24, 2018, 6:36 a.m. UTC | #2
Eric Blake <eblake@redhat.com> writes:

> In kill_qemu() we have an assert that checks that the QEMU process
> didn't dump core:
>             assert(!WCOREDUMP(wstatus));
>
> Unfortunately the WCOREDUMP macro here means the resulting message
> is not very easy to comprehend on at least some systems:
>
> ahci-test: tests/libqtest.c:113: kill_qemu: Assertion `!(((__extension__ (((union { __typeof(wstatus) __in; int __i; }) { .__in = (wstatus) }).__i))) & 0x80)' failed.
>
> and it doesn't identify what signal the process took.
>
> Furthermore, we are NOT detecting EINTR (while EINTR shouldn't be
> happening if we didn't install signal handlers, it's still better
> to always be robust), and also want to log unexpected non-zero status
> that was not accompanied by a core dump.
>
> Instead of using a raw assert, print the information in an
> easier to understand way:
>
> /i386/ahci/sanity: tests/libqtest.c:119: kill_qemu() detected QEMU death with core dump from signal 11 (Segmentation fault)
> Aborted (core dumped)
>
> (Of course, the really useful information would be why the QEMU
> process dumped core in the first place, but we don't have that
> by the time the test program has picked up the exit status.)
>
> Suggested-by: Peter Maydell <peter.maydell@linaro.org>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>
> I've taken the ideas from Peter's patch:
> https://lists.gnu.org/archive/html/qemu-devel/2018-07/msg04430.html
> as well as fixing a related issue brought up last time this was touched:
> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg05710.html
>
>  tests/libqtest.c | 39 +++++++++++++++++++++++++++++++++++++--
>  1 file changed, 37 insertions(+), 2 deletions(-)
>
> diff --git a/tests/libqtest.c b/tests/libqtest.c
> index 098af6aec44..f3dabfadd78 100644
> --- a/tests/libqtest.c
> +++ b/tests/libqtest.c
> @@ -105,12 +105,47 @@ static void kill_qemu(QTestState *s)
>      if (s->qemu_pid != -1) {
>          int wstatus = 0;
>          pid_t pid;
> +        bool die = false;
>
>          kill(s->qemu_pid, SIGTERM);
> +    retry:
>          pid = waitpid(s->qemu_pid, &wstatus, 0);
> +        if (pid == -1 && errno == EINTR) {
> +            goto retry;
> +        }
>
> -        if (pid == s->qemu_pid && WIFSIGNALED(wstatus)) {
> -            assert(!WCOREDUMP(wstatus));
> +        assert(pid == s->qemu_pid);
> +        /*
> +         * We expect qemu to exit with status 0; anything else is
> +         * fishy and should be logged.  Abort except when death by
> +         * signal is not accompanied by a coredump (as that's the only
> +         * time it was likely that the user is trying to kill the
> +         * testsuite early).
> +         */
> +        if (wstatus) {
> +            die = true;
> +            if (WIFEXITED(wstatus)) {
> +                fprintf(stderr, "%s:%d: kill_qemu() tried to terminate QEMU "
> +                        "process but encountered exit status %d\n",
> +                        __FILE__, __LINE__, WEXITSTATUS(wstatus));
> +            } else if (WIFSIGNALED(wstatus)) {
> +                int sig = WTERMSIG(wstatus);
> +                const char *signame = strsignal(sig) ?: "unknown ???";
> +
> +                if (!WCOREDUMP(wstatus)) {
> +                    die = false;

Does WCOREDUMP(wstatus) depend on the user's ulimit -c?

What higher level kind of QEMU termination are you trying to detect
here?

> +                    fprintf(stderr, "%s:%d: kill_qemu() ignoring QEMU death "
> +                            "by signal %d (%s)\n",
> +                            __FILE__, __LINE__, sig, signame);
> +                } else {
> +                    fprintf(stderr, "%s:%d: kill_qemu() detected QEMU death "
> +                            "with core dump from signal %d (%s)\n",
> +                            __FILE__, __LINE__, sig, signame);
> +                }
> +            }
> +        }
> +        if (die) {
> +            abort();
>          }
>      }
>  }
Thomas Huth July 24, 2018, 6:44 a.m. UTC | #3
On 23.07.2018 21:35, Eric Blake wrote:
> In kill_qemu() we have an assert that checks that the QEMU process
> didn't dump core:
>             assert(!WCOREDUMP(wstatus));
> 
> Unfortunately the WCOREDUMP macro here means the resulting message
> is not very easy to comprehend on at least some systems:
> 
> ahci-test: tests/libqtest.c:113: kill_qemu: Assertion `!(((__extension__ (((union { __typeof(wstatus) __in; int __i; }) { .__in = (wstatus) }).__i))) & 0x80)' failed.
> 
> and it doesn't identify what signal the process took.
> 
> Furthermore, we are NOT detecting EINTR (while EINTR shouldn't be
> happening if we didn't install signal handlers, it's still better
> to always be robust), and also want to log unexpected non-zero status
> that was not accompanied by a core dump.
> 
> Instead of using a raw assert, print the information in an
> easier to understand way:
> 
> /i386/ahci/sanity: tests/libqtest.c:119: kill_qemu() detected QEMU death with core dump from signal 11 (Segmentation fault)
> Aborted (core dumped)
> 
> (Of course, the really useful information would be why the QEMU
> process dumped core in the first place, but we don't have that
> by the time the test program has picked up the exit status.)
> 
> Suggested-by: Peter Maydell <peter.maydell@linaro.org>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> 
> I've taken the ideas from Peter's patch:
> https://lists.gnu.org/archive/html/qemu-devel/2018-07/msg04430.html
> as well as fixing a related issue brought up last time this was touched:
> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg05710.html
> 
>  tests/libqtest.c | 39 +++++++++++++++++++++++++++++++++++++--
>  1 file changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/libqtest.c b/tests/libqtest.c
> index 098af6aec44..f3dabfadd78 100644
> --- a/tests/libqtest.c
> +++ b/tests/libqtest.c
> @@ -105,12 +105,47 @@ static void kill_qemu(QTestState *s)
>      if (s->qemu_pid != -1) {
>          int wstatus = 0;
>          pid_t pid;
> +        bool die = false;
> 
>          kill(s->qemu_pid, SIGTERM);
> +    retry:
>          pid = waitpid(s->qemu_pid, &wstatus, 0);
> +        if (pid == -1 && errno == EINTR) {
> +            goto retry;
> +        }

    do {
        pid = waitpid(s->qemu_pid, &wstatus, 0);
    } while (pid == -1 && errno == EINTR);

?

Or use the TFR macro from include/qemu-common.h ?

 Thomas
Eric Blake July 24, 2018, 2:35 p.m. UTC | #4
On 07/24/2018 01:36 AM, Markus Armbruster wrote:
> Eric Blake <eblake@redhat.com> writes:
> 
>> In kill_qemu() we have an assert that checks that the QEMU process
>> didn't dump core:
>>              assert(!WCOREDUMP(wstatus));
>>
>> Unfortunately the WCOREDUMP macro here means the resulting message
>> is not very easy to comprehend on at least some systems:
>>

>> -        if (pid == s->qemu_pid && WIFSIGNALED(wstatus)) {
>> -            assert(!WCOREDUMP(wstatus));

>> +            } else if (WIFSIGNALED(wstatus)) {
>> +                int sig = WTERMSIG(wstatus);
>> +                const char *signame = strsignal(sig) ?: "unknown ???";
>> +
>> +                if (!WCOREDUMP(wstatus)) {
>> +                    die = false;
> 
> Does WCOREDUMP(wstatus) depend on the user's ulimit -c?

'man waitpid' on Linux mentions that WCOREDUMP is nonportable, but does 
not mention any interaction with setrlimit.  But a quick test shows:

$ ulimit -S -c 0
$ cat foo.c
int main(int argc, char **argv) {
   return *argv[1];
}
$ gcc -o foo -Wall foo.c
$ ./foo 1
$ ./foo
Segmentation fault (core dumped)
$

the output was produced by bash, which uses waitpid() - and therefore 
the fact that bash reports the core dump even when no core file is 
created is promising.
Markus Armbruster July 25, 2018, 4:17 p.m. UTC | #5
Eric Blake <eblake@redhat.com> writes:

> On 07/24/2018 01:36 AM, Markus Armbruster wrote:
>> Eric Blake <eblake@redhat.com> writes:
>>
>>> In kill_qemu() we have an assert that checks that the QEMU process
>>> didn't dump core:
>>>              assert(!WCOREDUMP(wstatus));
>>>
>>> Unfortunately the WCOREDUMP macro here means the resulting message
>>> is not very easy to comprehend on at least some systems:
>>>
>
>>> -        if (pid == s->qemu_pid && WIFSIGNALED(wstatus)) {
>>> -            assert(!WCOREDUMP(wstatus));
>
>>> +            } else if (WIFSIGNALED(wstatus)) {
>>> +                int sig = WTERMSIG(wstatus);
>>> +                const char *signame = strsignal(sig) ?: "unknown ???";
>>> +
>>> +                if (!WCOREDUMP(wstatus)) {
>>> +                    die = false;
>>
>> Does WCOREDUMP(wstatus) depend on the user's ulimit -c?
>
> 'man waitpid' on Linux mentions that WCOREDUMP is nonportable, but
> does not mention any interaction with setrlimit.  But a quick test
> shows:
>
> $ ulimit -S -c 0
> $ cat foo.c
> int main(int argc, char **argv) {
>   return *argv[1];
> }
> $ gcc -o foo -Wall foo.c
> $ ./foo 1
> $ ./foo
> Segmentation fault (core dumped)
> $
>
> the output was produced by bash, which uses waitpid() - and therefore
> the fact that bash reports the core dump even when no core file is
> created is promising.

Proof beats plausibility argument:

$ cat wcordump.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

int
main(void)
{
    pid_t child = fork();
    int wstatus;

    if (child < 0) {
	perror("fork");
	exit(1);
    }
    if (!child) {
	abort();
    }
    if (waitpid(child, &wstatus, 0) < 0) {
	perror("waitpid");
	exit(1);
    }
    if (WIFSIGNALED(wstatus)) {
	printf("sig %d %d\n",
	       WTERMSIG(wstatus), WCOREDUMP(wstatus));
    } else {
	printf("no sig\n");
    }
    exit(0);
}
$ gcc -Wall -g -O wcordump.c
$ (ulimit -c unlimited; ./a.out)
sig 6 128
$ (ulimit -c 0; ./a.out)
sig 6 0

Looks like WCOREDUMP() does depend on my ulimit -c.
Eric Blake July 30, 2018, 9:24 p.m. UTC | #6
On 07/25/2018 11:17 AM, Markus Armbruster wrote:

>>
>> the output was produced by bash, which uses waitpid() - and therefore
>> the fact that bash reports the core dump even when no core file is
>> created is promising.
> 
> Proof beats plausibility argument:
> 
> $ cat wcordump.c

> $ gcc -Wall -g -O wcordump.c
> $ (ulimit -c unlimited; ./a.out)
> sig 6 128
> $ (ulimit -c 0; ./a.out)
> sig 6 0

Doesn't match my results:

$ (ulimit -c 0; ./a.out)
sig 6 128

So, what's different between our two environments?

kernel 4.17.7-100.fc27.x86_64

$ echo /proc/sys/kernel/core_*
/proc/sys/kernel/core_pattern /proc/sys/kernel/core_pipe_limit 
/proc/sys/kernel/core_uses_pid
$ cat /proc/sys/kernel/core_*
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %e
0
1

> 
> Looks like WCOREDUMP() does depend on my ulimit -c.

Or worse, that its behavior is kernel/environment-sensitive.
Eric Blake July 30, 2018, 9:26 p.m. UTC | #7
On 07/24/2018 01:44 AM, Thomas Huth wrote:

>>
>> Furthermore, we are NOT detecting EINTR (while EINTR shouldn't be
>> happening if we didn't install signal handlers, it's still better
>> to always be robust), and also want to log unexpected non-zero status
>> that was not accompanied by a core dump.
>>

>>           kill(s->qemu_pid, SIGTERM);
>> +    retry:
>>           pid = waitpid(s->qemu_pid, &wstatus, 0);
>> +        if (pid == -1 && errno == EINTR) {
>> +            goto retry;
>> +        }
> 
>      do {
>          pid = waitpid(s->qemu_pid, &wstatus, 0);
>      } while (pid == -1 && errno == EINTR);
> 
> ?
> 
> Or use the TFR macro from include/qemu-common.h ?

Cool, I didn't know that macro existed! Will send v3.
Markus Armbruster July 31, 2018, 5:35 a.m. UTC | #8
Eric Blake <eblake@redhat.com> writes:

> On 07/25/2018 11:17 AM, Markus Armbruster wrote:
>
>>>
>>> the output was produced by bash, which uses waitpid() - and therefore
>>> the fact that bash reports the core dump even when no core file is
>>> created is promising.
>>
>> Proof beats plausibility argument:
>>
>> $ cat wcordump.c
>
>> $ gcc -Wall -g -O wcordump.c
>> $ (ulimit -c unlimited; ./a.out)
>> sig 6 128
>> $ (ulimit -c 0; ./a.out)
>> sig 6 0
>
> Doesn't match my results:
>
> $ (ulimit -c 0; ./a.out)
> sig 6 128
>
> So, what's different between our two environments?
>
> kernel 4.17.7-100.fc27.x86_64

4.17.7-200.fc28.x86_64

> $ echo /proc/sys/kernel/core_*
> /proc/sys/kernel/core_pattern /proc/sys/kernel/core_pipe_limit
> /proc/sys/kernel/core_uses_pid
> $ cat /proc/sys/kernel/core_*
> |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %e

This is "core" on my development boxes (I'm a happy caveman).

# cat /etc/sysctl.d/50-coredump.conf 
kernel.core_pattern=core

> 0
> 1
>
>>
>> Looks like WCOREDUMP() does depend on my ulimit -c.
>
> Or worse, that its behavior is kernel/environment-sensitive.

Looks like it.
diff mbox series

Patch

diff --git a/tests/libqtest.c b/tests/libqtest.c
index 098af6aec44..f3dabfadd78 100644
--- a/tests/libqtest.c
+++ b/tests/libqtest.c
@@ -105,12 +105,47 @@  static void kill_qemu(QTestState *s)
     if (s->qemu_pid != -1) {
         int wstatus = 0;
         pid_t pid;
+        bool die = false;

         kill(s->qemu_pid, SIGTERM);
+    retry:
         pid = waitpid(s->qemu_pid, &wstatus, 0);
+        if (pid == -1 && errno == EINTR) {
+            goto retry;
+        }

-        if (pid == s->qemu_pid && WIFSIGNALED(wstatus)) {
-            assert(!WCOREDUMP(wstatus));
+        assert(pid == s->qemu_pid);
+        /*
+         * We expect qemu to exit with status 0; anything else is
+         * fishy and should be logged.  Abort except when death by
+         * signal is not accompanied by a coredump (as that's the only
+         * time it was likely that the user is trying to kill the
+         * testsuite early).
+         */
+        if (wstatus) {
+            die = true;
+            if (WIFEXITED(wstatus)) {
+                fprintf(stderr, "%s:%d: kill_qemu() tried to terminate QEMU "
+                        "process but encountered exit status %d\n",
+                        __FILE__, __LINE__, WEXITSTATUS(wstatus));
+            } else if (WIFSIGNALED(wstatus)) {
+                int sig = WTERMSIG(wstatus);
+                const char *signame = strsignal(sig) ?: "unknown ???";
+
+                if (!WCOREDUMP(wstatus)) {
+                    die = false;
+                    fprintf(stderr, "%s:%d: kill_qemu() ignoring QEMU death "
+                            "by signal %d (%s)\n",
+                            __FILE__, __LINE__, sig, signame);
+                } else {
+                    fprintf(stderr, "%s:%d: kill_qemu() detected QEMU death "
+                            "with core dump from signal %d (%s)\n",
+                            __FILE__, __LINE__, sig, signame);
+                }
+            }
+        }
+        if (die) {
+            abort();
         }
     }
 }