diff mbox

qtest hang in /x86_64/ahci/io/ncq/simple (ppc64 host)

Message ID 559AD528.5030903@redhat.com
State New
Headers show

Commit Message

John Snow July 6, 2015, 7:21 p.m. UTC
On 07/06/2015 12:24 PM, Paolo Bonzini wrote:
> 
> 
> On 06/07/2015 17:41, John Snow wrote:
>> I'll take a look. More than possible there's a race in the wait
>> conditional that I have just never seen before.
>>
>> I tweaked it a little recently to wait on both NCQ and traditional
>> completion, but it doesn't have a timeout or anything. I'll go ahead and
>> fix that while I'm here.
>>
>> Should be something simple, hopefully.

The one Peter Maydell reported is indeed something simple. Sent out a patch.

> 
> My hang in test_halted_ncq is different:

... Great :)

> 
> (gdb) up
> #1  0x00007f3aaadb4476 in read (__nbytes=1, __buf=0x7ffc0652abbf,
> __fd=<optimized out>) at /usr/include/bits/unistd.h:44
> 44	  return __read_alias (__fd, __buf, __nbytes);
> (gdb)
> #2  qtest_qmp_receive (s=s@entry=0x7f3aac017360) at
> /home/work/upstream/qemu/tests/libqtest.c:371
> 371	        len = read(s->qmp_fd, &c, 1);
> (gdb)
> #3  0x00007f3aaadb4c48 in qtest_qmp_eventwait (s=0x7f3aac017360,
> event=event@entry=0x7f3aaadbdc68 "STOP") at
> /home/work/upstream/qemu/tests/libqtest.c:477
> 477	        response = qtest_qmp_receive(s);
> (gdb)
> #4  0x00007f3aaadb3514 in qmp_eventwait (event=0x7f3aaadbdc68 "STOP") at
> /home/work/upstream/qemu/tests/libqtest.h:498
> 498	    return qtest_qmp_eventwait(global_qtest, event);
> (gdb)
> #5  ahci_guest_io_halt (ahci=ahci@entry=0x7f3aac014820,
> port=port@entry=5 '\005', ide_cmd=ide_cmd@entry=97 'a',
> buffer=buffer@entry=1097728, bufsize=bufsize@entry=4096,
> sector=sector@entry=0)
>     at /home/work/upstream/qemu/tests/libqos/ahci.c:604
> 604	    qmp_eventwait("STOP");
> (gdb)
> #6  0x00007f3aaada9e09 in ahci_halted_io_test (cmd_read=<optimized out>,
> cmd_write=<optimized out>) at
> /home/work/upstream/qemu/tests/ahci-test.c:1244
> 1244	    cmd = ahci_guest_io_halt(ahci, port, cmd_write,
> 
> 
> In QEMU, current_run_state is RUN_STATE_RUNNING, so the halt didn't
> happen.  It's 100% reproducible with "make check-qtest-{i386,x86_64} V=1
> -j12".  This is the final part of the qtest log:
> 
> [R +0.066539] readl 0xe0000390
> [S +0.066553] OK 0x0000000000000000
> [R +0.066582] memset 0x10b000 0x100 0x00
> [S +0.066597] OK
> [R +0.067627] write 0x10c000 0x1000
[snip]
> [S +0.067748] OK
> [R +0.067782] readl 0xe00003b8
> [S +0.067796] OK 0x0000000000000000
> [R +0.067824] read 0x10a020 0x20
> [S +0.067843] OK
> 0x0000000000000000000000000000000000000000000000000000000000000000
> [R +0.067874] write 0x10a020 0x20
> 0x0000000000000000000000000000000000000000000000000000000000000000
> [S +0.067897] OK
> [R +0.067926] write 0x10a020 0x20
> 0x450401000000000000e010000000000000000000000000000000000000000000
> [S +0.067940] OK
> [R +0.067964] write 0x10e000 0x14 0x2780610800000040000000000800000000000000
> [S +0.067980] OK
> [R +0.068004] write 0x10e080 0x10 0x00c010000000000000000000ff0f0080
> [S +0.068018] OK
> [R +0.068042] writel 0xe00003b4 0x2
> [S +0.068056] OK
> [R +0.068079] writel 0xe00003b8 0x2
> [S +0.069500] OK
> 
> This is the /tmp/qtest-blkdebug.INHp7s file:
> 
> [inject-error]
> event = "write_aio"
> errno = "5"
> state = "1"
> immediately = "off"
> once = "on"
> [set-state]
> event = "write_aio"
> new_state = "2"
> 
> There is no trace of the write payload in the .qcow2 file.

Smells like the injected error prevented that successfully, but...

> 
> Paolo
> 

Can't reproduce: Fedora 21, 4.0.5, Today's QEMU HEAD (7edd8e56) ...
So it's a race condition of some sort.

Obvious questions: You've tried a distclean?
What was your configure invocation?

My attempt:
make distclean
../../configure --target-list="x86_64-softmmu i386-softmmu" --enable-debug
make -j9
make check-qtest-{i386,x86_64} V=1 -j12

Completes just fine for me.

That aside, I can't figure out why the machine HALT *and* the write
wouldn't occur. Can you stick a bunch of printfs inside the NCQ error
handling function and see if there's something goofing up in there, or
if it's even making it there?



My guess is I have goofed something up above there in the error handling
section of ncq_cb, but "worksforme," so I am not sure what's going on.

If no prints show up during testing, maybe turn on AHCI_DEBUG as well
and see if any prints are showing up in the success case.

--js
diff mbox

Patch

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index bb6a92f..f1af6f9 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -921,6 +921,7 @@  static void ncq_cb(void *opaque, int ret)
     IDEState *ide_state = &ncq_tfs->drive->port.ifs[0];

     if (ret == -ECANCELED) {
+        fprintf(stderr, "ncq_cb: -ECANCELED\n");
         return;
     }

@@ -928,12 +929,17 @@  static void ncq_cb(void *opaque, int ret)
         bool is_read = ncq_tfs->cmd == READ_FPDMA_QUEUED;
         BlockErrorAction action = blk_get_error_action(ide_state->blk,
                                                        is_read, -ret);
+        fprintf(stderr, "ncq_cb: ret (%d) is < 0; action is %s\n",
+                ret, BlockErrorAction_lookup[action]);
         if (action == BLOCK_ERROR_ACTION_STOP) {
             ncq_tfs->halt = true;
             ide_state->bus->error_status = IDE_RETRY_HBA;
+            fprintf(stderr, "ncq_cb: BLOCK_ERROR_ACTION_STOP\n");
         } else if (action == BLOCK_ERROR_ACTION_REPORT) {
             ncq_err(ncq_tfs);
+            fprintf(stderr, "ncq_cb: BLOCK_ERROR_ACTION_REPORT\n");
         }
+        fprintf(stderr, "ncq_cb: -> blk_error_action\n");
         blk_error_action(ide_state->blk, action, is_read, -ret);
     } else {
         ide_state->status = READY_STAT | SEEK_STAT;