From patchwork Fri May 24 21:23:02 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Laszlo Ersek X-Patchwork-Id: 246288 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id DD3AC2C00AA for ; Sat, 25 May 2013 07:21:27 +1000 (EST) Received: from localhost ([::1]:55563 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UfzQT-0007oS-Tk for incoming@patchwork.ozlabs.org; Fri, 24 May 2013 17:21:25 -0400 Received: from eggs.gnu.org ([208.118.235.92]:56041) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UfzPy-0007ix-Q6 for qemu-devel@nongnu.org; Fri, 24 May 2013 17:21:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UfzPq-0007mz-T3 for qemu-devel@nongnu.org; Fri, 24 May 2013 17:20:54 -0400 Received: from mx1.redhat.com ([209.132.183.28]:31009) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UfzPq-0007mt-Ic for qemu-devel@nongnu.org; Fri, 24 May 2013 17:20:46 -0400 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r4OLKgxL009799 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 24 May 2013 17:20:43 -0400 Received: from lacos-laptop.usersys.redhat.com (vpn1-6-105.ams2.redhat.com [10.36.6.105]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id r4OLKeWE004605; Fri, 24 May 2013 17:20:40 -0400 Message-ID: <519FDA36.1000202@redhat.com> Date: Fri, 24 May 2013 23:23:02 +0200 From: Laszlo Ersek User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130513 Thunderbird/17.0.6 MIME-Version: 1.0 To: Duane Voth References: In-Reply-To: X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12 X-MIME-Autoconverted: from 8bit to quoted-printable by mx1.redhat.com id r4OLKgxL009799 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 209.132.183.28 Cc: Luiz Capitulino , Bug 1180970 <1180970@bugs.launchpad.net>, qemu-devel@nongnu.org, Richard Henderson Subject: Re: [Qemu-devel] [Bug 1180970] *** affects all x86_64 soft emulation X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org On 05/24/13 19:25, Duane Voth wrote: > qemu: fatal: Trying to execute code outside RAM or ROM; worked in > 1.4.0, fails in 1.4.92 > > Want to bring a little attention to this bug - the break is in > target-i386/translate.c which affects all x86_64 soft emulation in a > fairly subtle way (ie. users will report a wide variety of problems > none of which seem to be related). I can't find a way to elevate bug > importance in launchpad. > > 4a6fd938f5457ee161d2acbd9364608a2a68b7a1 is the offending commit. > There have been numerous changes after this commit over top of the > change that broke emulation, so backing out this commit is not > trivial. > > I can reproduce the problem that is the subject of bug 1180970 for > testing easily. I can also reproduce this bug with my OVMF build, when KVM is disabled (current master). x86_64-softmmu/qemu-system-x86_64 -S -monitor stdio -m 1024 \ -vga cirrus -debugcon file:debug.log \ -global isa-debugcon.iobase=0x402 \ -bios /home/lacos/src/upstream/edk2-git-svn/out/OVMF.fd Again, this is how qemu aborts: > (qemu) qemu: fatal: Trying to execute code outside RAM or ROM at > 0x0000000100000000 > > RAX=000000003e084da8 RBX=000000003e084868 RCX=0000000000000000 RDX=000000003e084f00 > RSI=0000000000000001 RDI=000000003e085000 RBP=000000003e084708 RSP=000000003fac8510 > R8 =0000000000000000 R9 =000000003e14c3e3 R10=0000000000000033 R11=00000000000000d3 > R12=000000003e0848a0 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000 > RIP=00000000ffffffe4 RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 > ES =0008 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA] > CS =0028 0000000000000000 ffffffff 00af9b00 DPL=0 CS64 [-RA] > SS =0008 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA] > DS =0008 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA] > FS =0008 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA] > GS =0008 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA] > LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT > TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy > GDT= 000000003fa50e98 0000003f > IDT= 000000003f9d6e20 00000fff > CR0=80000033 CR2=0000000000000000 CR3=000000003fa67000 CR4=00000668 > [...] Repeating from last time, we found it interesting that RIP=00000000ffffffe4 but the problem address is 0x0000000100000000. I made some lame attempts to find out what code is running there, and -- since I've read the term "nop slide" recently --, I'll call it just that: -------------------------------[ debug patch]--------------------------- ------------------------------[ proposed fix ]-------------------------- I'll post it separately to the list for review. Luiz, can you please test it with Windows guests? Thanks, Laszlo diff --git a/target-i386/translate.c b/target-i386/translate.c index 0aeccdb..0e0356f 100644 --- a/target-i386/translate.c +++ b/target-i386/translate.c @@ -7197,6 +7197,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, /* misc */ case 0x90: /* nop */ /* XXX: correct lock test for all insn */ + fprintf(stderr, "nop @ %016lx\n", pc_start); if (prefixes & PREFIX_LOCK) { goto illegal_op; } -------------------------------[ debug patch]--------------------------- The output it produces leading up to the abort quoted above is: nop @ 00000000ffffffe4 nop @ 00000000ffffffe5 nop @ 00000000ffffffe6 nop @ 00000000ffffffe7 nop @ 00000000fffffff5 nop @ 00000000fffffff6 nop @ 00000000fffffff7 nop @ 00000000fffffff8 nop @ 00000000fffffff9 nop @ 00000000fffffffa nop @ 00000000fffffffb nop @ 00000000fffffffc nop @ 00000000fffffffd nop @ 00000000fffffffe nop @ 00000000ffffffff qemu: fatal: Trying to execute code outside RAM or ROM at 0x0000000100000000 Hence "nop slide". Peeking into the coredump triggered by abort(), the backtrace is as follows: #0 0x00007fd53c02b8a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007fd53c02d085 in abort () at abort.c:92 #2 0x00007fd5428d0333 in cpu_abort (env=0x7fd544b89c10, fmt= 0x7fd542a47440 "Trying to execute code outside RAM or ROM at 0x%016lx\n") at /home/lacos/src/upstream/qemu/exec.c:542 #3 0x00007fd5428c9aa4 in get_page_addr_code (env1=0x7fd544b89c10, addr=4294967296) at /home/lacos/src/upstream/qemu/cputlb.c:338 #4 0x00007fd5429de266 in tb_gen_code (env=0x7fd544b89c10, pc=4294967268, cs_base=0, flags=4244148, cflags=0) at /home/lacos/src/upstream/qemu/translate-all.c:966 #5 0x00007fd5428c431b in tb_find_slow (env=0x7fd544b89c10, pc=4294967268, cs_base=0, flags=4244148) at /home/lacos/src/upstream/qemu/cpu-exec.c:139 #6 0x00007fd5428c44c4 in tb_find_fast (env=0x7fd544b89c10) at /home/lacos/src/upstream/qemu/cpu-exec.c:166 #7 0x00007fd5428c4c78 in cpu_x86_exec (env=0x7fd544b89c10) at /home/lacos/src/upstream/qemu/cpu-exec.c:593 #8 0x00007fd5428c8058 in tcg_cpu_exec (env=0x7fd544b89c10) at /home/lacos/src/upstream/qemu/cpus.c:1144 #9 0x00007fd5428c81a3 in tcg_exec_all () at /home/lacos/src/upstream/qemu/cpus.c:1177 #10 0x00007fd5428c7321 in qemu_tcg_cpu_thread_fn (arg=0x7fd544b89b00) at /home/lacos/src/upstream/qemu/cpus.c:844 #11 0x00007fd540ab4851 in start_thread (arg=0x7fd51a5e7700) at pthread_create.c:301 #12 0x00007fd53c0e190d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Frames #4 and #3 are interesting, I think they explain the PC values we're seeing: we're at 0xffffffe4 (frame #4, tb_gen_code(), pc=4294967268): http://git.qemu.org/?p=qemu.git;a=blob;f=translate-all.c;h=211be314cbeea7c076031a3c0abe8c0ada62152b;hb=HEAD#l933 The call to frame #3 occurs on line 966. This is what happens in this function (I think!) tb_gen_code() [translate-all.c] cpu_gen_code() [called from translate-all.c:958] gen_intermediate_code() [target-i386/translate.c] gen_intermediate_code_internal() disas_insn() get_page_addr_code() [called from translate-all.c:966] disas_insn() is the function that prints the NOP slide (one NOP at a time), and it is also the function that commit 4a6fd938, bisected by Duane, modifies. So, basically tb_gen_code() generates a block of code ("translation block" I suppose) with cpu_gen_code(), then checks if the generated block still fits on the same page (see the assignment to "virt_page2" and below, under the git link). According to gdb, the translation block generated for this sequence of NOPs starting at 0xffffffe4 has size 140. Consequently it hangs over into the next page (virt_page2 == 4GB, second arg for frame #3, get_page_addr_code()), and this triggers the abort. The translated code is as follows: (gdb) frame 4 #4 0x00007fd5429de266 in tb_gen_code (env=0x7fd544b89c10, pc=4294967268, cs_base=0, flags=4244148, cflags=0) at /home/lacos/src/upstream/qemu/translate-all.c:966 966 phys_page2 = get_page_addr_code(env, virt_page2); (gdb) print *tb $1 = {pc = 4294967268, cs_base = 0, flags = 4244148, size = 140, cflags = 0, tc_ptr = 0x7fd52a161e30 "A\213nÈ\205í\017\205è\031", phys_hash_next = 0x0, page_next = {0x0, 0x0}, page_addr = {0, 0}, tb_next_offset = {65535, 65535}, tb_jmp_offset = {0, 0}, jmp_next = {0x0, 0x0}, jmp_first = 0x0, icount = 76} (gdb) x/140xb tb->tc_ptr 0x7fd52a161e30 : 0x41 0x8b 0x6e 0xc8 0x85 0xed 0x0f 0x85 0x7fd52a161e38 : 0xe8 0x19 0x00 0x00 0x49 0x8b 0x2e 0x48 0x7fd52a161e40 : 0x8b 0xfd 0x48 0x8b 0xf5 0x48 0xc1 0xef 0x7fd52a161e48 : 0x07 0x48 0x81 0xe6 0x00 0xf0 0xff 0xff 0x7fd52a161e50 : 0x81 0xe7 0xe0 0x1f 0x00 0x00 0x49 0x8d 0x7fd52a161e58 : 0xbc 0x3e 0x28 0x05 0x00 0x00 0x48 0x3b 0x7fd52a161e60 : 0x37 0x48 0x8b 0xf5 0x0f 0x85 0xc9 0x19 0x7fd52a161e68 : 0x00 0x00 0x48 0x03 0x77 0x18 0x0f 0xb6 0x7fd52a161e70 : 0x1e 0x48 0x03 0xdd 0x48 0x8b 0xfd 0x48 0x7fd52a161e78 : 0x8b 0xf5 0x48 0xc1 0xef 0x07 0x48 0x81 0x7fd52a161e80 : 0xe6 0x00 0xf0 0xff 0xff 0x81 0xe7 0xe0 0x7fd52a161e88 : 0x1f 0x00 0x00 0x49 0x8d 0xbc 0x3e 0x30 0x7fd52a161e90 : 0x05 0x00 0x00 0x48 0x3b 0x37 0x48 0x8b 0x7fd52a161e98 : 0xf5 0x0f 0x85 0xad 0x19 0x00 0x00 0x48 0x7fd52a161ea0 : 0x03 0x77 0x10 0x88 0x1e 0x4c 0x8b 0xe5 0x7fd52a161ea8 : 0x4d 0x89 0xa6 0x98 0x00 0x00 0x00 0x49 0x7fd52a161eb0 : 0x89 0x9e 0x90 0x00 0x00 0x00 0x48 0x8b 0x7fd52a161eb8 : 0xfd 0x48 0x8b 0xf5 (gdb) x/140i tb->tc_ptr 0x7fd52a161e30 : mov -0x38(%r14),%ebp 0x7fd52a161e34 : test %ebp,%ebp 0x7fd52a161e36 : jne 0x7fd52a163824 0x7fd52a161e3c : mov (%r14),%rbp 0x7fd52a161e3f : mov %rbp,%rdi 0x7fd52a161e42 : mov %rbp,%rsi 0x7fd52a161e45 : shr $0x7,%rdi 0x7fd52a161e49 : and $0xfffffffffffff000,%rsi 0x7fd52a161e50 : and $0x1fe0,%edi 0x7fd52a161e56 : lea 0x528(%r14,%rdi,1),%rdi 0x7fd52a161e5e : cmp (%rdi),%rsi 0x7fd52a161e61 : mov %rbp,%rsi 0x7fd52a161e64 : jne 0x7fd52a163833 0x7fd52a161e6a : add 0x18(%rdi),%rsi 0x7fd52a161e6e : movzbl (%rsi),%ebx 0x7fd52a161e71 : add %rbp,%rbx 0x7fd52a161e74 : mov %rbp,%rdi 0x7fd52a161e77 : mov %rbp,%rsi 0x7fd52a161e7a : shr $0x7,%rdi 0x7fd52a161e7e : and $0xfffffffffffff000,%rsi 0x7fd52a161e85 : and $0x1fe0,%edi 0x7fd52a161e8b : lea 0x530(%r14,%rdi,1),%rdi 0x7fd52a161e93 : cmp (%rdi),%rsi 0x7fd52a161e96 : mov %rbp,%rsi 0x7fd52a161e99 : jne 0x7fd52a16384c 0x7fd52a161e9f : add 0x10(%rdi),%rsi 0x7fd52a161ea3 : mov %bl,(%rsi) 0x7fd52a161ea5 : mov %rbp,%r12 0x7fd52a161ea8 : mov %r12,0x98(%r14) 0x7fd52a161eaf : mov %rbx,0x90(%r14) 0x7fd52a161eb6 : mov %rbp,%rdi 0x7fd52a161eb9 : mov %rbp,%rsi No clue what's going on! Interestingly, when I start OVMF, qemu actually runs it for a pretty long time; OVMF produces a bunch of its usual debug output in "debug.log". The last debug message issued before the abort is InstallProtocolInterface: 7AB33A91-ACE5-4326-B572-E7EE33D39F16 3E9ED340 which corresponds to Managed Network Protocol being installed (probably on top of the direct NIC driver, a Simple Network Protocol implementation). The NIC driver (= SNP) is loaded from the EFI oprom "pc-bios/efi-e1000.rom" (e1000 is the default NIC model on the PC target). If I add "-net none" to the quoted command line, there's no nop slide at 0xffffffe4, nor a crash, but the expected OVMF setup UI doesn't appear either. If I add -netdev user,id=net0 \ -device e1000,netdev=net0,romfile=/dev/null (OVMF finds no driver for the e1000 NIC), it has the same effect as -net note -- no crash, no UI. Changing the NIC model to virtio-net-pci, and allowing qemu to load its oprom driver ("pc-bios/efi-virtio.rom"), that is, adding: -netdev user,id=net0 \ -device virtio-net-pci,netdev=net0 the VM runs in an endless reboot loop rather than crashing, and yes it reboots in the same place, after the Managed Network Protocol is installed. Lastly, the options -netdev user,id=net0 \ -device virtio-net-pci,netdev=net0,romfile=/dev/null disable the iPXE driver in "pc-bios/efi-virtio.rom", and allow my VirtioNetDxe driver (embedded in my OVMF build and pending on the edk2-devel list, but not yet merged) to drive the virtio NIC. In this case: - Managed Network Protocol is installed OK (according to debug.log), - no crash, no UI -- same as with "-net none" and with "e1000,romfile=/dev/null". So OVMF loading an iPXE driver from the NIC ROM bar interacts badly with commit 4a6fd938. ... I know zilch about instruction decoding, but commit 4a6fd938 seems to change the effect of PREFIX_ADR. Consider TARGET_X86_64 && CODE64(s) && (prefixes & PREFIX_ADR) - before the patch, "aflag" was not changed (from the original s->code32), - after the patch, the least significant bit of "aflag" is inverted. ... I can't believe it! The patch below fixes it for me! OVMF boots and I'm dropped to the EFI shell, after which I can also access the setup UI. ------------------------------[ proposed fix ]-------------------------- diff --git a/target-i386/translate.c b/target-i386/translate.c index 0e0356f..4fbd6c0 100644 --- a/target-i386/translate.c +++ b/target-i386/translate.c @@ -4813,7 +4813,11 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, /* 0x66 is ignored if rex.w is set */ dflag = 2; } - if (!(prefixes & PREFIX_ADR)) { + if (prefixes & PREFIX_ADR) { + /* flip it back, 0x67 should have no effect */ + aflag ^= 1; + } + else { aflag = 2; } }