diff mbox

KVM guests freeze under upstream kernel

Message ID 20170721011818.GC13187@pacoca (mailing list archive)
State Not Applicable
Headers show

Commit Message

Jose Ricardo Ziviani July 21, 2017, 1:18 a.m. UTC
On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote:
> On Thu, Jul 20, 2017 at 12:02:23AM -0300, joserz@linux.vnet.ibm.com wrote:
> > On Thu, Jul 20, 2017 at 09:42:50AM +1000, Benjamin Herrenschmidt wrote:
> > > On Wed, 2017-07-19 at 16:46 -0300, joserz@linux.vnet.ibm.com wrote:
> > > > Hello!
> > > > 
> > > > We're not able to boot any KVM guest using upstream kernel (cb8c65ccff7f77d0285f1b126c72d37b2572c865 - 4.13.0-rc1+).
> > > > After reaching the SLOF initial counting, the guest simply freezes:
> > > 
> > > Can you send our .config ?
> > 
> > Sure,
> > 
> > Answering Michael as well:
> > 
> > It's a P9 with RHEL kernel 4.11.0-10.el7a.ppc64le installed. The problem
> > was noticed with kernel > 4.13 (I'm currently running 4.13.0-rc1+).
> > 
> > QEMU is https://github.com/dgibson/qemu (ppc-for-2.10) but I gave the
> > default packaged Qemu a try.
> > 
> > For the guest, I tried both a vanilla Ubuntu 17.04 and the host kernel.
> > But they had never a chance to run since the freezing happened in SLOF.
> > 
> > Note that using the 4.11.0-10.el7a.ppc64le kernel it works fine
> > (for any of these Qemu/Guest setup). With 4.13.0-rc1 I have it run after
> > reverting that referred commit.
> 
> Is the host kernel running in radix mode?

yes

> 
> Did you check the host kernel logs for any oops messages?

dmesg was clean but after sometime waiting (I forgot QEMU running in
another terminal) I got the oops below (after rebooting the host I 
couldn't reproduce it again).

Another test that I did was:
Compile with transparent huge pages disabled: KVM works fine
Compile with transparent huge pages enabled: doesn't work
  + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't work

Just out of my own curiosity I made this small change:


and it works. I chose _RPAGE_RSV3 because it uses the same value that
x86 uses (0x0400000000000000UL) but I don't if it could have any side
effect


SLOF
**********************************************************************
QEMU Starting
 Build Date = Mar  3 2017 13:29:19
  FW Version = git-66d250ef0fd06bb8
   Press "s" to enter Open Firmware.

   [  105.604333] Unable to handle kernel paging request for data at
   address 0x00000000
   [  105.604448] Faulting instruction address: 0xc000000000910b28
   [  105.604526] Oops: Kernel access of bad area, sig: 11 [#1]
   [  105.604585] SMP NR_CPUS=2048 
   [  105.604588] NUMA 
   [  105.604633] PowerNV
   [  105.604697] Modules linked in: xt_CHECKSUM ipt_MASQUERADE
   nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4
   ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
   ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6
   nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
   ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
   nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
   ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
   kvm_hv kvm i2c_dev at24 ghash_generic ses enclosure gf128mul
   scsi_transport_sas xts sg ctr ipmi_powernv ipmi_devintf shpchp
   opal_prd vmx_crypto ipmi_msghandler uio_pdrv_genirq uio ofpart
   powernv_flash i2c_opal ibmpowernv mtd nfsd auth_rpcgss nfs_acl lockd
   grace sunrpc ip_tables xfs libcrc32c
   [  105.605561]  sd_mod ast i2c_algo_bit drm_kms_helper syscopyarea
   sysfillrect sysimgblt fb_sys_fops ttm drm i40e i2c_core aacraid ptp
   pps_core dm_mirror dm_region_hash dm_log dm_mod
   [  105.605759] CPU: 0 PID: 6 Comm: kworker/u32:0 Not tainted
   4.13.0-rc1+ #57
   [  105.605836] Workqueue: netns cleanup_net
   [  105.605880] task: c000000ff6404200 task.stack: c000000ff648c000
   [  105.605947] NIP: c000000000910b28 LR: c0000000007cd6ec CTR:
   c0000000007cd5d0
   [  105.606026] REGS: c000000ff648f7d0 TRAP: 0300   Not tainted
   (4.13.0-rc1+)
   [  105.606090] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
   [  105.606111]   CR: 88002048  XER: 20000000
   [  105.606203] CFAR: c0000000007cd6e8 DAR: 0000000000000000 DSISR:
   40000000 SOFTE: 1 
   [  105.606203] GPR00: c0000000007cd6ec c000000ff648fa50
   c000000000f5c600 0000000000000000 
   [  105.606203] GPR04: c000000ff6404cc0 c000000ff6404280
   00000000782ccd5c 00000000cc908fe7 
   [  105.606203] GPR08: ffffffffffffffff c000000ff648c000
   0000000080000000 0000000000000000 
   [  105.606203] GPR12: c0000000007cd5d0 c00000000fb00000
   c0000000001050f8 c000000ffa150ec0 
   [  105.606203] GPR16: 0000000000000000 0000000000000000
   0000000000000000 c000000ffa1602a8 
   [  105.606203] GPR20: c000000ffa160078 c000000ff648fc20
   c000000000f03f68 c000000000f04080 
   [  105.606203] GPR24: 0000000001c9d4d8 0000000000000000
   0000000000000000 c000000ff951a280 
   [  105.606203] GPR28: c000000ffa202510 c000200e56e19bd0
   c000200e5bb48000 0000000000000000 
   [  105.606942] NIP [c000000000910b28] _raw_spin_lock_bh+0x38/0xd0
   [  105.607012] LR [c0000000007cd6ec] netlink_release+0x11c/0x5d0
   [  105.607078] Call Trace:
   [  105.607112] [c000000ff648fa50] [c000000ff648fb50]
   0xc000000ff648fb50 (unreliable)
   [  105.607196] [c000000ff648fa80] [c0000000007cd6ec]
   netlink_release+0x11c/0x5d0
   [  105.607278] [c000000ff648faf0] [c000000000752564]
   sock_release+0x44/0x100
   [  105.607353] [c000000ff648fb60] [c0000000007ca37c]
   netlink_kernel_release+0x2c/0x40
   [  105.607437] [c000000ff648fb80] [c00000000086eaa8]
   xfrm_user_net_exit+0x88/0xc0
   [  105.607519] [c000000ff648fbb0] [c00000000076d76c]
   ops_exit_list.isra.7+0x9c/0xc0
   [  105.607601] [c000000ff648fbf0] [c00000000076e450]
   cleanup_net+0x250/0x3d0
   [  105.607695] [c000000ff648fca0] [c0000000000fd240]
   process_one_work+0x180/0x460
   [  105.607778] [c000000ff648fd30] [c0000000000fd5a8]
   worker_thread+0x88/0x500
   [  105.607849] [c000000ff648fdc0] [c000000000105250]
   kthread+0x160/0x1a0
   [  105.607922] [c000000ff648fe30] [c00000000000b3a4]
   ret_from_kernel_thread+0x5c/0xb8
   [  105.608001] Instruction dump:
   [  105.608044] 7c0802a6 fbe1fff8 7c7f1b78 78290464 f8010010 f821ffd1
   8149000c 394a0200 
   [  105.608136] 9149000c 39400000 994d028c 814d0008 <7d201829>
   2c090000 40c20010 7d40192d 
   [  105.608234] ---[ end trace 58bb750815698d9b ]---
   [  107.018194] 
   [  109.018391] Kernel panic - not syncing: Fatal exception in
   interrupt
   [  110.234517] Rebooting in 10 seconds..
   [  120.253605] Trying to free IRQ 496 from IRQ context!
   [  120.253707] ------------[ cut here ]------------


> 
> Paul.
>

Comments

Jose Ricardo Ziviani July 26, 2017, 1:18 p.m. UTC | #1
On Thu, Jul 20, 2017 at 10:18:18PM -0300, joserz@linux.vnet.ibm.com wrote:
> On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote:
> > On Thu, Jul 20, 2017 at 12:02:23AM -0300, joserz@linux.vnet.ibm.com wrote:
> > > On Thu, Jul 20, 2017 at 09:42:50AM +1000, Benjamin Herrenschmidt wrote:
> > > > On Wed, 2017-07-19 at 16:46 -0300, joserz@linux.vnet.ibm.com wrote:
> > > > > Hello!
> > > > > 
> > > > > We're not able to boot any KVM guest using upstream kernel (cb8c65ccff7f77d0285f1b126c72d37b2572c865 - 4.13.0-rc1+).
> > > > > After reaching the SLOF initial counting, the guest simply freezes:
> > > > 
> > > > Can you send our .config ?
> > > 
> > > Sure,
> > > 
> > > Answering Michael as well:
> > > 
> > > It's a P9 with RHEL kernel 4.11.0-10.el7a.ppc64le installed. The problem
> > > was noticed with kernel > 4.13 (I'm currently running 4.13.0-rc1+).
> > > 
> > > QEMU is https://github.com/dgibson/qemu (ppc-for-2.10) but I gave the
> > > default packaged Qemu a try.
> > > 
> > > For the guest, I tried both a vanilla Ubuntu 17.04 and the host kernel.
> > > But they had never a chance to run since the freezing happened in SLOF.
> > > 
> > > Note that using the 4.11.0-10.el7a.ppc64le kernel it works fine
> > > (for any of these Qemu/Guest setup). With 4.13.0-rc1 I have it run after
> > > reverting that referred commit.
> > 
> > Is the host kernel running in radix mode?
> 
> yes
> 
> > 
> > Did you check the host kernel logs for any oops messages?
> 
> dmesg was clean but after sometime waiting (I forgot QEMU running in
> another terminal) I got the oops below (after rebooting the host I 
> couldn't reproduce it again).
> 
> Another test that I did was:
> Compile with transparent huge pages disabled: KVM works fine
> Compile with transparent huge pages enabled: doesn't work
>   + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't work
> 
> Just out of my own curiosity I made this small change:
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> b/arch/powerpc/include
> index c0737c8..f94a3b6 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -80,7 +80,7 @@
>  
>   #define _PAGE_SOFT_DIRTY       _RPAGE_SW3 /* software: software dirty
>   tracking 
>    #define _PAGE_SPECIAL          _RPAGE_SW2 /* software: special page */
>    -#define _PAGE_DEVMAP           _RPAGE_SW1 /* software: ZONE_DEVICE page */
>    +#define _PAGE_DEVMAP           _RPAGE_RSV3
>     #define __HAVE_ARCH_PTE_DEVMAP
> 
> and it works. I chose _RPAGE_RSV3 because it uses the same value that
> x86 uses (0x0400000000000000UL) but I don't if it could have any side
> effect
> 

Does this change make any sense to you people?
I didn't see any side effect expect that devices backed memory will have
a bigger address space in transparent huge pages IF I understand that
correctly.

If so I can send a patch with this change.

Thank you!!
Michael Ellerman July 27, 2017, 3:14 a.m. UTC | #2
joserz@linux.vnet.ibm.com writes:
> On Thu, Jul 20, 2017 at 10:18:18PM -0300, joserz@linux.vnet.ibm.com wrote:
>> On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote:
>> > 
>> > Did you check the host kernel logs for any oops messages?
>> 
>> dmesg was clean but after sometime waiting (I forgot QEMU running in
>> another terminal) I got the oops below (after rebooting the host I 
>> couldn't reproduce it again).
>> 
>> Another test that I did was:
>> Compile with transparent huge pages disabled: KVM works fine
>> Compile with transparent huge pages enabled: doesn't work
>>   + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't work
>> 
>> Just out of my own curiosity I made this small change:
>> 
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> b/arch/powerpc/include
>> index c0737c8..f94a3b6 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -80,7 +80,7 @@
>>  
>>   #define _PAGE_SOFT_DIRTY       _RPAGE_SW3 /* software: software dirty
>>   tracking 
>>    #define _PAGE_SPECIAL          _RPAGE_SW2 /* software: special page */
>>    -#define _PAGE_DEVMAP           _RPAGE_SW1 /* software: ZONE_DEVICE page */
>>    +#define _PAGE_DEVMAP           _RPAGE_RSV3
>>     #define __HAVE_ARCH_PTE_DEVMAP
>> 
>> and it works. I chose _RPAGE_RSV3 because it uses the same value that
>> x86 uses (0x0400000000000000UL) but I don't if it could have any side
>> effect
>> 
>
> Does this change make any sense to you people?

No :)

I think it's just hiding the bug somehow. Presumably we have some code
somewhere that is getting confused by _RPAGE_SW1 being set, or setting
that bit incorrectly.

cheers
Suraj Jitindar Singh July 27, 2017, 6:56 a.m. UTC | #3
On Thu, 2017-07-27 at 13:14 +1000, Michael Ellerman wrote:
> joserz@linux.vnet.ibm.com writes:
> > On Thu, Jul 20, 2017 at 10:18:18PM -0300, joserz@linux.vnet.ibm.com
> >  wrote:
> > > On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote:
> > > > 
> > > > Did you check the host kernel logs for any oops messages?
> > > 
> > > dmesg was clean but after sometime waiting (I forgot QEMU running
> > > in
> > > another terminal) I got the oops below (after rebooting the host
> > > I 
> > > couldn't reproduce it again).
> > > 
> > > Another test that I did was:
> > > Compile with transparent huge pages disabled: KVM works fine
> > > Compile with transparent huge pages enabled: doesn't work
> > >   + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't
> > > work
> > > 
> > > Just out of my own curiosity I made this small change:
> > > 
> > > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> > > b/arch/powerpc/include
> > > index c0737c8..f94a3b6 100644
> > > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> > > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> > > @@ -80,7 +80,7 @@
> > >  
> > >   #define _PAGE_SOFT_DIRTY       _RPAGE_SW3 /* software: software
> > > dirty
> > >   tracking 
> > >    #define _PAGE_SPECIAL          _RPAGE_SW2 /* software: special
> > > page */
> > >    -#define _PAGE_DEVMAP           _RPAGE_SW1 /* software:
> > > ZONE_DEVICE page */
> > >    +#define _PAGE_DEVMAP           _RPAGE_RSV3
> > >     #define __HAVE_ARCH_PTE_DEVMAP
> > > 
> > > and it works. I chose _RPAGE_RSV3 because it uses the same value
> > > that
> > > x86 uses (0x0400000000000000UL) but I don't if it could have any
> > > side
> > > effect
> > > 
> > 
> > Does this change make any sense to you people?
> 
> No :)
> 
> I think it's just hiding the bug somehow. Presumably we have some
> code
> somewhere that is getting confused by _RPAGE_SW1 being set, or
> setting
> that bit incorrectly.

kernel BUG at /scratch/surajjs/linux/arch/powerpc/include/asm/book3s/64/radix.h:260!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=2048 
NUMA 
PowerNV
Modules linked in:
CPU: 3 PID: 2050 Comm: qemu-system-ppc Not tainted 4.13.0-rc2-00001-g2f3013c-dirty #1
task: c000000f1ebc0000 task.stack: c000000f1ec00000
NIP: c000000000070fd4 LR: c0000000000e2120 CTR: c0000000000e20d0
REGS: c000000f1ec036b0 TRAP: 0700   Not tainted  (4.13.0-rc2-00001-g2f3013c-dirty)
MSR: 900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
  CR: 22244824  XER: 00000000
CFAR: c000000000070e74 SOFTE: 1 
GPR00: 0000000000000009 c000000f1ec03930 c000000001067400 0000000019cf0a05 
GPR04: c000000000000000 050acf190f000080 0000000000000005 0000000000000800 
GPR08: 0000000000000015 8000000f19cf0a05 c000000f1eb64368 0000000000000009 
GPR12: 0000000000000009 c00000000fd80f00 c000000f1eca7a30 4000000000000000 
GPR16: 5f9fffffffff1780 4000000000002000 00007fff5fff0000 00007fff879700a6 
GPR20: 8000000000000108 c00000000110bce0 0000000000000f61 c0000000000e20d0 
GPR24: 000000000000ffff c000000f1c7a6008 00007fff6f600000 00007fff5fff0000 
GPR28: c000000f19fd0000 000000000da00000 0000000000000000 c000000f1ec03990 
NIP [c000000000070fd4] __find_linux_pte_or_hugepte+0x1d4/0x350
LR [c0000000000e2120] kvm_unmap_radix+0x50/0x1d0
Call Trace:
[c000000f1ec03930] [c0000000000b2554] mark_page_dirty+0x34/0xa0 (unreliable)
[c000000f1ec03970] [c0000000000e2120] kvm_unmap_radix+0x50/0x1d0
[c000000f1ec039c0] [c0000000000dbea0] kvm_handle_hva_range+0x100/0x170
[c000000f1ec03a30] [c0000000000df43c] kvm_unmap_hva_range_hv+0x6c/0x80
[c000000f1ec03a70] [c0000000000c7588] kvm_unmap_hva_range+0x48/0x60
[c000000f1ec03ab0] [c0000000000bb77c] kvm_mmu_notifier_invalidate_range_start+0x8c/0x130
[c000000f1ec03b10] [c000000000316f10] __mmu_notifier_invalidate_range_start+0xa0/0xf0
[c000000f1ec03b60] [c0000000002e95f0] change_protection+0x840/0xe20
[c000000f1ec03cb0] [c000000000313050] change_prot_numa+0x50/0xd0
[c000000f1ec03d00] [c000000000143f24] task_numa_work+0x2b4/0x3b0
[c000000f1ec03dc0] [c000000000128738] task_work_run+0xf8/0x160
[c000000f1ec03e00] [c00000000001db94] do_notify_resume+0xe4/0xf0
[c000000f1ec03e30] [c00000000000b744] ret_from_except_lite+0x70/0x74
Instruction dump:
419e00ec 60000000 78a70022 54a9403e 50a9c00e 54e3403e 50a9c42e 50e3c00e 
50e3c42e 792907c6 7d291b78 55270528 <0b070000> 3ce04000 3c804000 78e707c6 
---[ end trace aecf406c356566bb ]---


The bug on added was:

arch/powerpc/include/asm/book3s/64/radix.h:260:
258 static inline int radix__pmd_trans_huge(pmd_t pmd)
259 {
260         BUG_ON(pmd_val(pmd) & _PAGE_DEVMAP);
261         return (pmd_val(pmd) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
262 }

> 
> cheers
diff mbox

Patch

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
b/arch/powerpc/include
index c0737c8..f94a3b6 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -80,7 +80,7 @@ 
 
  #define _PAGE_SOFT_DIRTY       _RPAGE_SW3 /* software: software dirty
  tracking 
   #define _PAGE_SPECIAL          _RPAGE_SW2 /* software: special page */
   -#define _PAGE_DEVMAP           _RPAGE_SW1 /* software: ZONE_DEVICE page */
   +#define _PAGE_DEVMAP           _RPAGE_RSV3
    #define __HAVE_ARCH_PTE_DEVMAP