diff mbox series

iotests: Remove 030 from the auto group

Message ID 20200904055701.462482-1-thuth@redhat.com
State New
Headers show
Series iotests: Remove 030 from the auto group | expand

Commit Message

Thomas Huth Sept. 4, 2020, 5:57 a.m. UTC
Test 030 is still occasionally failing in the CI ... so for the
time being, let's disable it in the "auto" group. We can add it
back once it got more stable.

Signed-off-by: Thomas Huth <thuth@redhat.com>
---
 I just saw the problem here:
  https://cirrus-ci.com/task/5449330930745344?command=main#L6482
 and Peter hit it a couple of weeks ago:
  https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg00136.html

 tests/qemu-iotests/group | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Kevin Wolf Sept. 4, 2020, 8:25 a.m. UTC | #1
Am 04.09.2020 um 07:57 hat Thomas Huth geschrieben:
> Test 030 is still occasionally failing in the CI ... so for the
> time being, let's disable it in the "auto" group. We can add it
> back once it got more stable.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>

I would rather just disable this one test function as 030 is a pretty
important one that tends to catch bugs.

>  I just saw the problem here:
>   https://cirrus-ci.com/task/5449330930745344?command=main#L6482
>  and Peter hit it a couple of weeks ago:
>   https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg00136.html

I wonder how this can still happen. The test should have more than
enough time to complete now. Except if the throttling doesn't work as
expected.

I can't seem to reproduce this even if I add rather long delays. After
40 seconds, all jobs have moved either by 512k (which is STREAM_CHUNK)
or not at all.

What is interesting is that in both cases it's stream-node8, which is
the job streaming from node6 to node8, and node8 is the top-level node.
It's also the last job to be changed to full speed, so all others did
succeed before.

Kevin
Max Reitz Sept. 4, 2020, 8:31 a.m. UTC | #2
On 04.09.20 07:57, Thomas Huth wrote:
> Test 030 is still occasionally failing in the CI ... so for the
> time being, let's disable it in the "auto" group. We can add it
> back once it got more stable.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  I just saw the problem here:
>   https://cirrus-ci.com/task/5449330930745344?command=main#L6482
>  and Peter hit it a couple of weeks ago:
>   https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg00136.html
> 
>  tests/qemu-iotests/group | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks, applied to my block branch:

https://git.xanclic.moe/XanClic/qemu/commits/branch/block
Max Reitz Sept. 4, 2020, 8:31 a.m. UTC | #3
On 04.09.20 10:31, Max Reitz wrote:
> On 04.09.20 07:57, Thomas Huth wrote:
>> Test 030 is still occasionally failing in the CI ... so for the
>> time being, let's disable it in the "auto" group. We can add it
>> back once it got more stable.
>>
>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>> ---
>>  I just saw the problem here:
>>   https://cirrus-ci.com/task/5449330930745344?command=main#L6482
>>  and Peter hit it a couple of weeks ago:
>>   https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg00136.html
>>
>>  tests/qemu-iotests/group | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Thanks, applied to my block branch:
> 
> https://git.xanclic.moe/XanClic/qemu/commits/branch/block

Or maybe not O:)
Thomas Huth Sept. 4, 2020, 10:14 a.m. UTC | #4
On 04/09/2020 10.25, Kevin Wolf wrote:
> Am 04.09.2020 um 07:57 hat Thomas Huth geschrieben:
>> Test 030 is still occasionally failing in the CI ... so for the
>> time being, let's disable it in the "auto" group. We can add it
>> back once it got more stable.
>>
>> Signed-off-by: Thomas Huth <thuth@redhat.com>
> 
> I would rather just disable this one test function as 030 is a pretty
> important one that tends to catch bugs.

Ok, ... should it always get disabled, or shall we try to come up with
some magic checks so that it only gets disabled in the CI pipelines (...
though I don't have a clue how to check for Peter's merge test
environment...)?

 Thomas
Kevin Wolf Sept. 4, 2020, 10:37 a.m. UTC | #5
Am 04.09.2020 um 12:14 hat Thomas Huth geschrieben:
> On 04/09/2020 10.25, Kevin Wolf wrote:
> > Am 04.09.2020 um 07:57 hat Thomas Huth geschrieben:
> >> Test 030 is still occasionally failing in the CI ... so for the
> >> time being, let's disable it in the "auto" group. We can add it
> >> back once it got more stable.
> >>
> >> Signed-off-by: Thomas Huth <thuth@redhat.com>
> > 
> > I would rather just disable this one test function as 030 is a pretty
> > important one that tends to catch bugs.
> 
> Ok, ... should it always get disabled, or shall we try to come up with
> some magic checks so that it only gets disabled in the CI pipelines (...
> though I don't have a clue how to check for Peter's merge test
> environment...)?

Maybe we can detect whether we're run as part of the "auto" group and
skip the test then (as in QMPTestCase.case_skip)?

Kevin
Max Reitz Sept. 4, 2020, 10:38 a.m. UTC | #6
On 04.09.20 12:14, Thomas Huth wrote:
> On 04/09/2020 10.25, Kevin Wolf wrote:
>> Am 04.09.2020 um 07:57 hat Thomas Huth geschrieben:
>>> Test 030 is still occasionally failing in the CI ... so for the
>>> time being, let's disable it in the "auto" group. We can add it
>>> back once it got more stable.
>>>
>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>
>> I would rather just disable this one test function as 030 is a pretty
>> important one that tends to catch bugs.
> 
> Ok, ... should it always get disabled, or shall we try to come up with
> some magic checks so that it only gets disabled in the CI pipelines (...
> though I don't have a clue how to check for Peter's merge test
> environment...)?

I suppose we could let check-block.sh set some environment variable.

Max
Thomas Huth Sept. 4, 2020, 11:51 a.m. UTC | #7
On 04/09/2020 12.38, Max Reitz wrote:
> On 04.09.20 12:14, Thomas Huth wrote:
>> On 04/09/2020 10.25, Kevin Wolf wrote:
>>> Am 04.09.2020 um 07:57 hat Thomas Huth geschrieben:
>>>> Test 030 is still occasionally failing in the CI ... so for the
>>>> time being, let's disable it in the "auto" group. We can add it
>>>> back once it got more stable.
>>>>
>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>
>>> I would rather just disable this one test function as 030 is a pretty
>>> important one that tends to catch bugs.
>>
>> Ok, ... should it always get disabled, or shall we try to come up with
>> some magic checks so that it only gets disabled in the CI pipelines (...
>> though I don't have a clue how to check for Peter's merge test
>> environment...)?
> 
> I suppose we could let check-block.sh set some environment variable.

Sounds like a plan! I'll try to cook a patch.

 Thomas
Alberto Garcia Sept. 23, 2020, 6:18 p.m. UTC | #8
On Fri 04 Sep 2020 10:25:13 AM CEST, Kevin Wolf wrote:
>> Test 030 is still occasionally failing in the CI ... so for the
>> time being, let's disable it in the "auto" group. We can add it
>> back once it got more stable.
>> 
>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>
> I would rather just disable this one test function as 030 is a pretty
> important one that tends to catch bugs.
>
>>  I just saw the problem here:
>>   https://cirrus-ci.com/task/5449330930745344?command=main#L6482
>>  and Peter hit it a couple of weeks ago:
>>   https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg00136.html
>
> I wonder how this can still happen. The test should have more than
> enough time to complete now. Except if the throttling doesn't work as
> expected.
>
> I can't seem to reproduce this even if I add rather long delays. After
> 40 seconds, all jobs have moved either by 512k (which is STREAM_CHUNK)
> or not at all.

I also don't understand how this can fail... I assume the test is not
running for that long in the cases when it fails, right?

Berto
Thomas Huth Sept. 24, 2020, 4:08 a.m. UTC | #9
On 23/09/2020 20.18, Alberto Garcia wrote:
> On Fri 04 Sep 2020 10:25:13 AM CEST, Kevin Wolf wrote:
>>> Test 030 is still occasionally failing in the CI ... so for the
>>> time being, let's disable it in the "auto" group. We can add it
>>> back once it got more stable.
>>>
>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>
>> I would rather just disable this one test function as 030 is a pretty
>> important one that tends to catch bugs.
>>
>>>  I just saw the problem here:
>>>   https://cirrus-ci.com/task/5449330930745344?command=main#L6482
>>>  and Peter hit it a couple of weeks ago:
>>>   https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg00136.html
>>
>> I wonder how this can still happen. The test should have more than
>> enough time to complete now. Except if the throttling doesn't work as
>> expected.
>>
>> I can't seem to reproduce this even if I add rather long delays. After
>> 40 seconds, all jobs have moved either by 512k (which is STREAM_CHUNK)
>> or not at all.
> 
> I also don't understand how this can fail... I assume the test is not
> running for that long in the cases when it fails, right?

Hard to say ... the problem only occurs occasionally, and I've never
seen it happen "live", only in the CI logs after the job has failed. I
guess you'd have to print timestamps in the code and then submit a lot
of jobs to the CI systems that are sensitive to this problem (e.g.
Cirrus and Travis) to find out...

 Thomas
diff mbox series

Patch

diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 5cad015231..f084061a16 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -51,7 +51,7 @@ 
 027 rw auto quick
 028 rw backing quick
 029 rw auto quick
-030 rw auto backing
+030 rw backing
 031 rw auto quick
 032 rw auto quick
 033 rw auto quick