diff mbox series

[v2,2/7] iotests: exclude killed processes from running under Valgrind

Message ID 1560276131-683243-3-git-send-email-andrey.shinkevich@virtuozzo.com
State New
Headers show
Series Allow Valgrind checking all QEMU processes | expand

Commit Message

Andrey Shinkevich June 11, 2019, 6:02 p.m. UTC
The Valgrind tool fails to manage its termination when QEMU raises the
signal SIGKILL. Lets exclude such test cases from running under the
Valgrind because there is no sense to check memory issues that way.

Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
---
 tests/qemu-iotests/039 | 5 +++++
 tests/qemu-iotests/061 | 2 ++
 tests/qemu-iotests/137 | 1 +
 3 files changed, 8 insertions(+)

Comments

Vladimir Sementsov-Ogievskiy June 13, 2019, 9:47 a.m. UTC | #1
11.06.2019 21:02, Andrey Shinkevich wrote:
> The Valgrind tool fails to manage its termination when QEMU raises the
> signal SIGKILL. Lets exclude such test cases from running under the
> Valgrind because there is no sense to check memory issues that way.
> 
> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> ---
>   tests/qemu-iotests/039 | 5 +++++
>   tests/qemu-iotests/061 | 2 ++
>   tests/qemu-iotests/137 | 1 +
>   3 files changed, 8 insertions(+)
> 
> diff --git a/tests/qemu-iotests/039 b/tests/qemu-iotests/039
> index 0d4e963..95115e2 100755
> --- a/tests/qemu-iotests/039
> +++ b/tests/qemu-iotests/039
> @@ -65,6 +65,7 @@ echo "== Creating a dirty image file =="
>   IMGOPTS="compat=1.1,lazy_refcounts=on"
>   _make_test_img $size
>   
> +VALGRIND_QEMU="" \


Shouldn't it be written once per test, just without "\" ?

>   $QEMU_IO -c "write -P 0x5a 0 512" \
>            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
>       | _filter_qemu_io
> @@ -100,6 +101,7 @@ echo "== Opening a dirty image read/write should repair it =="
>   IMGOPTS="compat=1.1,lazy_refcounts=on"
>   _make_test_img $size
>   
> +VALGRIND_QEMU="" \
>   $QEMU_IO -c "write -P 0x5a 0 512" \
>            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
>       | _filter_qemu_io
> @@ -118,6 +120,7 @@ echo "== Creating an image file with lazy_refcounts=off =="
>   IMGOPTS="compat=1.1,lazy_refcounts=off"
>   _make_test_img $size
>   
> +VALGRIND_QEMU="" \
>   $QEMU_IO -c "write -P 0x5a 0 512" \
>            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
>       | _filter_qemu_io
> @@ -151,6 +154,7 @@ echo "== Changing lazy_refcounts setting at runtime =="
>   IMGOPTS="compat=1.1,lazy_refcounts=off"
>   _make_test_img $size
>   
> +VALGRIND_QEMU="" \
>   $QEMU_IO -c "reopen -o lazy-refcounts=on" \
>            -c "write -P 0x5a 0 512" \
>            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
> @@ -163,6 +167,7 @@ _check_test_img
>   IMGOPTS="compat=1.1,lazy_refcounts=on"
>   _make_test_img $size
>   
> +VALGRIND_QEMU="" \
>   $QEMU_IO -c "reopen -o lazy-refcounts=off" \
>            -c "write -P 0x5a 0 512" \
>            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
> diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
> index d7dbd7e..5d0724c 100755
> --- a/tests/qemu-iotests/061
> +++ b/tests/qemu-iotests/061
> @@ -73,6 +73,7 @@ echo
>   echo "=== Testing dirty version downgrade ==="
>   echo
>   IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M
> +VALGRIND_QEMU="" \
>   $QEMU_IO -c "write -P 0x2a 0 128k" -c flush \
>            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 | _filter_qemu_io
>   $PYTHON qcow2.py "$TEST_IMG" dump-header
> @@ -107,6 +108,7 @@ echo
>   echo "=== Testing dirty lazy_refcounts=off ==="
>   echo
>   IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M
> +VALGRIND_QEMU="" \
>   $QEMU_IO -c "write -P 0x2a 0 128k" -c flush \
>            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 | _filter_qemu_io
>   $PYTHON qcow2.py "$TEST_IMG" dump-header
> diff --git a/tests/qemu-iotests/137 b/tests/qemu-iotests/137
> index 0c3d2a1..a442fc8 100755
> --- a/tests/qemu-iotests/137
> +++ b/tests/qemu-iotests/137
> @@ -130,6 +130,7 @@ echo
>   
>   # Whether lazy-refcounts was actually enabled can easily be tested: Check if
>   # the dirty bit is set after a crash
> +VALGRIND_QEMU="" \
>   $QEMU_IO \
>       -c "reopen -o lazy-refcounts=on,overlap-check=blubb" \
>       -c "write -P 0x5a 0 512" \
>
Kevin Wolf June 17, 2019, 11:15 a.m. UTC | #2
Am 11.06.2019 um 20:02 hat Andrey Shinkevich geschrieben:
> The Valgrind tool fails to manage its termination when QEMU raises the
> signal SIGKILL. Lets exclude such test cases from running under the
> Valgrind because there is no sense to check memory issues that way.
> 
> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>

I don't fully understand the reasoning here. Most interesting memory
access errors happen before a process terminates. (I'm not talking about
leaks here, but use-after-free, buffer overflows, uninitialised memory
etc.)

However, I do see that running these test cases with -valgrind ends in a
hang because the valgrind process keeps hanging around as a zombie
process and the test case doesn't reap it. I'm not exactly sure why that
is, but it looks more like a problem with the parent process (i.e. the
bash script).

If we can't figure out how to fix this, we can disable valgrind in these
cases, but I think the explanation needs to be different.

> diff --git a/tests/qemu-iotests/039 b/tests/qemu-iotests/039
> index 0d4e963..95115e2 100755
> --- a/tests/qemu-iotests/039
> +++ b/tests/qemu-iotests/039
> @@ -65,6 +65,7 @@ echo "== Creating a dirty image file =="
>  IMGOPTS="compat=1.1,lazy_refcounts=on"
>  _make_test_img $size
>  
> +VALGRIND_QEMU="" \
>  $QEMU_IO -c "write -P 0x5a 0 512" \
>           -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
>      | _filter_qemu_io

I agree with Vladimir that setting VALGRIND_QEMU only once at the top of
the script is probably the better option.

Kevin
Roman Kagan June 17, 2019, 11:57 a.m. UTC | #3
On Thu, Jun 13, 2019 at 12:47:21PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 11.06.2019 21:02, Andrey Shinkevich wrote:
> > The Valgrind tool fails to manage its termination when QEMU raises the
> > signal SIGKILL. Lets exclude such test cases from running under the
> > Valgrind because there is no sense to check memory issues that way.
> > 
> > Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> > ---
> >   tests/qemu-iotests/039 | 5 +++++
> >   tests/qemu-iotests/061 | 2 ++
> >   tests/qemu-iotests/137 | 1 +
> >   3 files changed, 8 insertions(+)
> > 
> > diff --git a/tests/qemu-iotests/039 b/tests/qemu-iotests/039
> > index 0d4e963..95115e2 100755
> > --- a/tests/qemu-iotests/039
> > +++ b/tests/qemu-iotests/039
> > @@ -65,6 +65,7 @@ echo "== Creating a dirty image file =="
> >   IMGOPTS="compat=1.1,lazy_refcounts=on"
> >   _make_test_img $size
> >   
> > +VALGRIND_QEMU="" \
> 
> 
> Shouldn't it be written once per test, just without "\" ?

Only qemu-io invocations that perform raise(KILL) need to bypass
valgrinding.  Clearing VALGRIND_QEMU globally will indulge all qemu-io
throughout the test.

Roman.

> >   $QEMU_IO -c "write -P 0x5a 0 512" \
> >            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
> >       | _filter_qemu_io
> > @@ -100,6 +101,7 @@ echo "== Opening a dirty image read/write should repair it =="
> >   IMGOPTS="compat=1.1,lazy_refcounts=on"
> >   _make_test_img $size
> >   
> > +VALGRIND_QEMU="" \
> >   $QEMU_IO -c "write -P 0x5a 0 512" \
> >            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
> >       | _filter_qemu_io
> > @@ -118,6 +120,7 @@ echo "== Creating an image file with lazy_refcounts=off =="
> >   IMGOPTS="compat=1.1,lazy_refcounts=off"
> >   _make_test_img $size
> >   
> > +VALGRIND_QEMU="" \
> >   $QEMU_IO -c "write -P 0x5a 0 512" \
> >            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
> >       | _filter_qemu_io
> > @@ -151,6 +154,7 @@ echo "== Changing lazy_refcounts setting at runtime =="
> >   IMGOPTS="compat=1.1,lazy_refcounts=off"
> >   _make_test_img $size
> >   
> > +VALGRIND_QEMU="" \
> >   $QEMU_IO -c "reopen -o lazy-refcounts=on" \
> >            -c "write -P 0x5a 0 512" \
> >            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
> > @@ -163,6 +167,7 @@ _check_test_img
> >   IMGOPTS="compat=1.1,lazy_refcounts=on"
> >   _make_test_img $size
> >   
> > +VALGRIND_QEMU="" \
> >   $QEMU_IO -c "reopen -o lazy-refcounts=off" \
> >            -c "write -P 0x5a 0 512" \
> >            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
> > diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
> > index d7dbd7e..5d0724c 100755
> > --- a/tests/qemu-iotests/061
> > +++ b/tests/qemu-iotests/061
> > @@ -73,6 +73,7 @@ echo
> >   echo "=== Testing dirty version downgrade ==="
> >   echo
> >   IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M
> > +VALGRIND_QEMU="" \
> >   $QEMU_IO -c "write -P 0x2a 0 128k" -c flush \
> >            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 | _filter_qemu_io
> >   $PYTHON qcow2.py "$TEST_IMG" dump-header
> > @@ -107,6 +108,7 @@ echo
> >   echo "=== Testing dirty lazy_refcounts=off ==="
> >   echo
> >   IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M
> > +VALGRIND_QEMU="" \
> >   $QEMU_IO -c "write -P 0x2a 0 128k" -c flush \
> >            -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 | _filter_qemu_io
> >   $PYTHON qcow2.py "$TEST_IMG" dump-header
> > diff --git a/tests/qemu-iotests/137 b/tests/qemu-iotests/137
> > index 0c3d2a1..a442fc8 100755
> > --- a/tests/qemu-iotests/137
> > +++ b/tests/qemu-iotests/137
> > @@ -130,6 +130,7 @@ echo
> >   
> >   # Whether lazy-refcounts was actually enabled can easily be tested: Check if
> >   # the dirty bit is set after a crash
> > +VALGRIND_QEMU="" \
> >   $QEMU_IO \
> >       -c "reopen -o lazy-refcounts=on,overlap-check=blubb" \
> >       -c "write -P 0x5a 0 512" \
> > 
> 
> 
> -- 
> Best regards,
> Vladimir
Roman Kagan June 17, 2019, 12:18 p.m. UTC | #4
On Mon, Jun 17, 2019 at 01:15:04PM +0200, Kevin Wolf wrote:
> Am 11.06.2019 um 20:02 hat Andrey Shinkevich geschrieben:
> > The Valgrind tool fails to manage its termination when QEMU raises the
> > signal SIGKILL. Lets exclude such test cases from running under the
> > Valgrind because there is no sense to check memory issues that way.
> > 
> > Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> 
> I don't fully understand the reasoning here. Most interesting memory
> access errors happen before a process terminates. (I'm not talking about
> leaks here, but use-after-free, buffer overflows, uninitialised memory
> etc.)

Nothing of the above, and nothing in general, happens in the usermode
process upon SIGKILL delivery.  

> However, I do see that running these test cases with -valgrind ends in a
> hang because the valgrind process keeps hanging around as a zombie
> process and the test case doesn't reap it. I'm not exactly sure why that
> is, but it looks more like a problem with the parent process (i.e. the
> bash script).

It rather looks like valgrind getting confused about what to do with
raise(SIGKILL) in the multithreaded case.

> If we can't figure out how to fix this, we can disable valgrind in these
> cases, but I think the explanation needs to be different.
> 
> > diff --git a/tests/qemu-iotests/039 b/tests/qemu-iotests/039
> > index 0d4e963..95115e2 100755
> > --- a/tests/qemu-iotests/039
> > +++ b/tests/qemu-iotests/039
> > @@ -65,6 +65,7 @@ echo "== Creating a dirty image file =="
> >  IMGOPTS="compat=1.1,lazy_refcounts=on"
> >  _make_test_img $size
> >  
> > +VALGRIND_QEMU="" \
> >  $QEMU_IO -c "write -P 0x5a 0 512" \
> >           -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
> >      | _filter_qemu_io
> 
> I agree with Vladimir that setting VALGRIND_QEMU only once at the top of
> the script is probably the better option.

It is not, because there's no reason for qemu-io invocations that don't
perform raise(SIGKILL) to escape valgrinding.

Roman.
Kevin Wolf June 17, 2019, 12:53 p.m. UTC | #5
Am 17.06.2019 um 14:18 hat Roman Kagan geschrieben:
> On Mon, Jun 17, 2019 at 01:15:04PM +0200, Kevin Wolf wrote:
> > Am 11.06.2019 um 20:02 hat Andrey Shinkevich geschrieben:
> > > The Valgrind tool fails to manage its termination when QEMU raises the
> > > signal SIGKILL. Lets exclude such test cases from running under the
> > > Valgrind because there is no sense to check memory issues that way.
> > > 
> > > Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> > 
> > I don't fully understand the reasoning here. Most interesting memory
> > access errors happen before a process terminates. (I'm not talking about
> > leaks here, but use-after-free, buffer overflows, uninitialised memory
> > etc.)
> 
> Nothing of the above, and nothing in general, happens in the usermode
> process upon SIGKILL delivery.

My point is, the interesting part is what the program does before
SIGKILL happens. There is value in reporting memory errors as long as we
can, even if the final check doesn't happen because of SIGKILL.

> > However, I do see that running these test cases with -valgrind ends in a
> > hang because the valgrind process keeps hanging around as a zombie
> > process and the test case doesn't reap it. I'm not exactly sure why that
> > is, but it looks more like a problem with the parent process (i.e. the
> > bash script).
> 
> It rather looks like valgrind getting confused about what to do with
> raise(SIGKILL) in the multithreaded case.

Well, valgrind can't do anything with SIGKILL, obviously, because it's
killed immediately. But maybe the kernel does get confused for some
reason. I get the main threads as a zombie, but a second is still
running. Sending SIGKILL to the second thread, too, makes the test case
complete successfully.

So I guess the main question is why the second thread isn't
automatically killed when the main thread receives SIGKILL.

Kevin
Roman Kagan June 17, 2019, 1:20 p.m. UTC | #6
On Mon, Jun 17, 2019 at 02:53:55PM +0200, Kevin Wolf wrote:
> Am 17.06.2019 um 14:18 hat Roman Kagan geschrieben:
> > On Mon, Jun 17, 2019 at 01:15:04PM +0200, Kevin Wolf wrote:
> > > Am 11.06.2019 um 20:02 hat Andrey Shinkevich geschrieben:
> > > > The Valgrind tool fails to manage its termination when QEMU raises the
> > > > signal SIGKILL. Lets exclude such test cases from running under the
> > > > Valgrind because there is no sense to check memory issues that way.
> > > > 
> > > > Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> > > 
> > > I don't fully understand the reasoning here. Most interesting memory
> > > access errors happen before a process terminates. (I'm not talking about
> > > leaks here, but use-after-free, buffer overflows, uninitialised memory
> > > etc.)
> > 
> > Nothing of the above, and nothing in general, happens in the usermode
> > process upon SIGKILL delivery.
> 
> My point is, the interesting part is what the program does before
> SIGKILL happens. There is value in reporting memory errors as long as we
> can, even if the final check doesn't happen because of SIGKILL.

Agreed in general, but here the testcases that include 'sigraise 9' only
do simple operations before that which are covered elsewhere too.  So
the extra effort on making valgrind work with these testcases arguably
isn't worth the extra value to be gained.

> > > However, I do see that running these test cases with -valgrind ends in a
> > > hang because the valgrind process keeps hanging around as a zombie
> > > process and the test case doesn't reap it. I'm not exactly sure why that
> > > is, but it looks more like a problem with the parent process (i.e. the
> > > bash script).
> > 
> > It rather looks like valgrind getting confused about what to do with
> > raise(SIGKILL) in the multithreaded case.
> 
> Well, valgrind can't do anything with SIGKILL, obviously, because it's
> killed immediately.

Right, but it can do whatever it wants with raise(SIGKILL).  I haven't
looked at valgrind sources, but

  # strace -ff valgind qemu-io -c 'sigraise 9'

shows SIGKILL neither sent nor received by any thread; it just shows the
main thread exit and the second thread getting stuck waiting on a futex.

> But maybe the kernel does get confused for some
> reason. I get the main threads as a zombie, but a second is still
> running. Sending SIGKILL to the second thread, too, makes the test case
> complete successfully.
> 
> So I guess the main question is why the second thread isn't
> automatically killed when the main thread receives SIGKILL.

I don't see any thread receive SIGKILL.  So I tend to think this is
valgrind's bug/feature.

Anyway the problem is outside of QEMU, so I think we need to weigh the
costs of investigating it and implementing a workaround with the
potential benefit.

Roman.
Kevin Wolf June 17, 2019, 2:51 p.m. UTC | #7
Am 17.06.2019 um 15:20 hat Roman Kagan geschrieben:
> On Mon, Jun 17, 2019 at 02:53:55PM +0200, Kevin Wolf wrote:
> > Am 17.06.2019 um 14:18 hat Roman Kagan geschrieben:
> > > On Mon, Jun 17, 2019 at 01:15:04PM +0200, Kevin Wolf wrote:
> > > > Am 11.06.2019 um 20:02 hat Andrey Shinkevich geschrieben:
> > > > > The Valgrind tool fails to manage its termination when QEMU raises the
> > > > > signal SIGKILL. Lets exclude such test cases from running under the
> > > > > Valgrind because there is no sense to check memory issues that way.
> > > > > 
> > > > > Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> > > > 
> > > > I don't fully understand the reasoning here. Most interesting memory
> > > > access errors happen before a process terminates. (I'm not talking about
> > > > leaks here, but use-after-free, buffer overflows, uninitialised memory
> > > > etc.)
> > > 
> > > Nothing of the above, and nothing in general, happens in the usermode
> > > process upon SIGKILL delivery.
> > 
> > My point is, the interesting part is what the program does before
> > SIGKILL happens. There is value in reporting memory errors as long as we
> > can, even if the final check doesn't happen because of SIGKILL.
> 
> Agreed in general, but here the testcases that include 'sigraise 9' only
> do simple operations before that which are covered elsewhere too.  So
> the extra effort on making valgrind work with these testcases arguably
> isn't worth the extra value to be gained.

Ok, fair enough.

> > > > However, I do see that running these test cases with -valgrind ends in a
> > > > hang because the valgrind process keeps hanging around as a zombie
> > > > process and the test case doesn't reap it. I'm not exactly sure why that
> > > > is, but it looks more like a problem with the parent process (i.e. the
> > > > bash script).
> > > 
> > > It rather looks like valgrind getting confused about what to do with
> > > raise(SIGKILL) in the multithreaded case.
> > 
> > Well, valgrind can't do anything with SIGKILL, obviously, because it's
> > killed immediately.
> 
> Right, but it can do whatever it wants with raise(SIGKILL).  I haven't
> looked at valgrind sources, but
> 
>   # strace -ff valgind qemu-io -c 'sigraise 9'
> 
> shows SIGKILL neither sent nor received by any thread; it just shows the
> main thread exit and the second thread getting stuck waiting on a futex.

Oh, I didn't see this! So there isn't even a real SIGKILL signal.

> > But maybe the kernel does get confused for some
> > reason. I get the main threads as a zombie, but a second is still
> > running. Sending SIGKILL to the second thread, too, makes the test case
> > complete successfully.
> > 
> > So I guess the main question is why the second thread isn't
> > automatically killed when the main thread receives SIGKILL.
> 
> I don't see any thread receive SIGKILL.  So I tend to think this is
> valgrind's bug/feature.
> 
> Anyway the problem is outside of QEMU, so I think we need to weigh the
> costs of investigating it and implementing a workaround with the
> potential benefit.

I'd suggest to file a bug against valgrind at least. And indeed just
disable valgrind here like this patch does.

Kevin
Andrey Shinkevich June 24, 2019, 4:55 p.m. UTC | #8
On 17/06/2019 17:51, Kevin Wolf wrote:
> Am 17.06.2019 um 15:20 hat Roman Kagan geschrieben:
>> On Mon, Jun 17, 2019 at 02:53:55PM +0200, Kevin Wolf wrote:
>>> Am 17.06.2019 um 14:18 hat Roman Kagan geschrieben:
>>>> On Mon, Jun 17, 2019 at 01:15:04PM +0200, Kevin Wolf wrote:
>>>>> Am 11.06.2019 um 20:02 hat Andrey Shinkevich geschrieben:
>>>>>> The Valgrind tool fails to manage its termination when QEMU raises the
>>>>>> signal SIGKILL. Lets exclude such test cases from running under the
>>>>>> Valgrind because there is no sense to check memory issues that way.
>>>>>>
>>>>>> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
>>>>>
>>>>> I don't fully understand the reasoning here. Most interesting memory
>>>>> access errors happen before a process terminates. (I'm not talking about
>>>>> leaks here, but use-after-free, buffer overflows, uninitialised memory
>>>>> etc.)
>>>>
>>>> Nothing of the above, and nothing in general, happens in the usermode
>>>> process upon SIGKILL delivery.
>>>
>>> My point is, the interesting part is what the program does before
>>> SIGKILL happens. There is value in reporting memory errors as long as we
>>> can, even if the final check doesn't happen because of SIGKILL.
>>
>> Agreed in general, but here the testcases that include 'sigraise 9' only
>> do simple operations before that which are covered elsewhere too.  So
>> the extra effort on making valgrind work with these testcases arguably
>> isn't worth the extra value to be gained.
> 
> Ok, fair enough.
> 
>>>>> However, I do see that running these test cases with -valgrind ends in a
>>>>> hang because the valgrind process keeps hanging around as a zombie
>>>>> process and the test case doesn't reap it. I'm not exactly sure why that
>>>>> is, but it looks more like a problem with the parent process (i.e. the
>>>>> bash script).
>>>>
>>>> It rather looks like valgrind getting confused about what to do with
>>>> raise(SIGKILL) in the multithreaded case.
>>>
>>> Well, valgrind can't do anything with SIGKILL, obviously, because it's
>>> killed immediately.
>>
>> Right, but it can do whatever it wants with raise(SIGKILL).  I haven't
>> looked at valgrind sources, but
>>
>>    # strace -ff valgind qemu-io -c 'sigraise 9'
>>
>> shows SIGKILL neither sent nor received by any thread; it just shows the
>> main thread exit and the second thread getting stuck waiting on a futex.
> 
> Oh, I didn't see this! So there isn't even a real SIGKILL signal.
> 
>>> But maybe the kernel does get confused for some
>>> reason. I get the main threads as a zombie, but a second is still
>>> running. Sending SIGKILL to the second thread, too, makes the test case
>>> complete successfully.
>>>
>>> So I guess the main question is why the second thread isn't
>>> automatically killed when the main thread receives SIGKILL.
>>
>> I don't see any thread receive SIGKILL.  So I tend to think this is
>> valgrind's bug/feature.
>>
>> Anyway the problem is outside of QEMU, so I think we need to weigh the
>> costs of investigating it and implementing a workaround with the
>> potential benefit.
> 
> I'd suggest to file a bug against valgrind at least. And indeed just
> disable valgrind here like this patch does.
> 
> Kevin
> 

I have reported the issue to the KDE Bugtracking System on bugs.kde.org
as instructed on the www.valgrind.org/support/bug_reports.html

The bug 409141 "Valgrind hangs when SIGKILLed" has been created.
The thread can be seen on https://bugs.kde.org/show_bug.cgi?id=409141

Andrey
diff mbox series

Patch

diff --git a/tests/qemu-iotests/039 b/tests/qemu-iotests/039
index 0d4e963..95115e2 100755
--- a/tests/qemu-iotests/039
+++ b/tests/qemu-iotests/039
@@ -65,6 +65,7 @@  echo "== Creating a dirty image file =="
 IMGOPTS="compat=1.1,lazy_refcounts=on"
 _make_test_img $size
 
+VALGRIND_QEMU="" \
 $QEMU_IO -c "write -P 0x5a 0 512" \
          -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
     | _filter_qemu_io
@@ -100,6 +101,7 @@  echo "== Opening a dirty image read/write should repair it =="
 IMGOPTS="compat=1.1,lazy_refcounts=on"
 _make_test_img $size
 
+VALGRIND_QEMU="" \
 $QEMU_IO -c "write -P 0x5a 0 512" \
          -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
     | _filter_qemu_io
@@ -118,6 +120,7 @@  echo "== Creating an image file with lazy_refcounts=off =="
 IMGOPTS="compat=1.1,lazy_refcounts=off"
 _make_test_img $size
 
+VALGRIND_QEMU="" \
 $QEMU_IO -c "write -P 0x5a 0 512" \
          -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
     | _filter_qemu_io
@@ -151,6 +154,7 @@  echo "== Changing lazy_refcounts setting at runtime =="
 IMGOPTS="compat=1.1,lazy_refcounts=off"
 _make_test_img $size
 
+VALGRIND_QEMU="" \
 $QEMU_IO -c "reopen -o lazy-refcounts=on" \
          -c "write -P 0x5a 0 512" \
          -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
@@ -163,6 +167,7 @@  _check_test_img
 IMGOPTS="compat=1.1,lazy_refcounts=on"
 _make_test_img $size
 
+VALGRIND_QEMU="" \
 $QEMU_IO -c "reopen -o lazy-refcounts=off" \
          -c "write -P 0x5a 0 512" \
          -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 \
diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
index d7dbd7e..5d0724c 100755
--- a/tests/qemu-iotests/061
+++ b/tests/qemu-iotests/061
@@ -73,6 +73,7 @@  echo
 echo "=== Testing dirty version downgrade ==="
 echo
 IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M
+VALGRIND_QEMU="" \
 $QEMU_IO -c "write -P 0x2a 0 128k" -c flush \
          -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 | _filter_qemu_io
 $PYTHON qcow2.py "$TEST_IMG" dump-header
@@ -107,6 +108,7 @@  echo
 echo "=== Testing dirty lazy_refcounts=off ==="
 echo
 IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M
+VALGRIND_QEMU="" \
 $QEMU_IO -c "write -P 0x2a 0 128k" -c flush \
          -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 | _filter_qemu_io
 $PYTHON qcow2.py "$TEST_IMG" dump-header
diff --git a/tests/qemu-iotests/137 b/tests/qemu-iotests/137
index 0c3d2a1..a442fc8 100755
--- a/tests/qemu-iotests/137
+++ b/tests/qemu-iotests/137
@@ -130,6 +130,7 @@  echo
 
 # Whether lazy-refcounts was actually enabled can easily be tested: Check if
 # the dirty bit is set after a crash
+VALGRIND_QEMU="" \
 $QEMU_IO \
     -c "reopen -o lazy-refcounts=on,overlap-check=blubb" \
     -c "write -P 0x5a 0 512" \