diff mbox

[v4,0/6] spapr/xics: fix migration of older machine types

Message ID 87fuf0w6dp.fsf@abhimanyu.i-did-not-set--mail-host-address--so-tickle-me
State New
Headers show

Commit Message

Nikunj A Dadhania June 16, 2017, 10:53 a.m. UTC
Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> writes:

> Greg Kurz <groug@kaod.org> writes:
>
>> On Sun, 11 Jun 2017 17:38:42 +0800
>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>
>>> On Fri, Jun 09, 2017 at 05:09:13PM +0200, Greg Kurz wrote:
>>> > On Fri, 9 Jun 2017 20:28:32 +1000
>>> > David Gibson <david@gibson.dropbear.id.au> wrote:
>>> >   
>>> > > On Fri, Jun 09, 2017 at 11:36:31AM +0200, Greg Kurz wrote:  
>>> > > > On Fri, 9 Jun 2017 12:28:13 +1000
>>> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
>>> > > >     
>>> > 1) start guest
>>> > 
>>> > qemu-system-ppc64 \
>>> >  -nodefaults -nographic -snapshot -no-shutdown -serial mon:stdio \
>>> >  -device virtio-net,netdev=netdev0,id=net0 \
>>> >  -netdev bridge,id=netdev0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
>>> >  -device virtio-blk,drive=drive0,id=blk0 \
>>> >  -drive file=/home/greg/images/sle12-sp1-ppc64le.qcow2,id=drive0,if=none \
>>> >  -machine type=pseries,accel=tcg -cpu POWER8
>
> Strangely, your command line does not have multiple threads. Need to see
> what is the side effect of enabling MTTCG by default here.
>
>>> > 
>>> > 2) migrate
>>> > 
>>> > 3) destination crashes (immediately or after very short delay) or
>>> > hangs  
>>> 
>>> Ok.  I'll bisect it when I can, but you might well get to it first.
>>> 
>>> 
>>
>> Heh, maybe you didn't see in my mail but I did bisect:
>>
>> f0b0685d6694a28c66018f438e822596243b1250 is the first bad commit
>> commit f0b0685d6694a28c66018f438e822596243b1250
>> Author: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
>> Date:   Thu Apr 27 10:48:23 2017 +0530
>>
>>     tcg: enable MTTCG by default for PPC64 on x86
>
> Let me have a look at it.

Interesting problem here, I see that when the migration is completed on
source and there is a crash on destination:

[   56.185314] Unable to handle kernel paging request for data at address 0x5deadbeef0000108
[   56.185401] Faulting instruction address: 0xc000000000277bc8

   0xc000000000277bb8 <+168>:	ld      r7,8(r4)
   0xc000000000277bbc <+172>:	ld      r6,0(r4)                  <========
   0xc000000000277bc0 <+176>:	ori     r8,r8,56302
   0xc000000000277bc4 <+180>:	rldicr  r8,r8,32,31
   0xc000000000277bc8 <+184>:	std     r7,8(r6)

r4 = 0xf0000000000107a0
r6 = 0x5deadbeef0000100

Code at 0xc000000000277bbc <+172>, gave junk value in r6, that leads to
the guest crash. When I inspect the memory on source and destination in
qemu monitor, I get the following differences:


Source had a valid address at 0xf0000000000107a0, while garbage on the
destination.

Some observations:

* Source updates the memory location (probably atomic_cmpxchg), but the
  updated page didnt get transferred to the destination
  
* Getting rid of atomic_cmpxchg tcg ops in ldarx/stdcx, makes migration
  work fine. MTTCG running with 1 cpu.

While I continue debugging, any hints would help.

Regards
Nikunj

Comments

David Gibson June 16, 2017, 2:28 p.m. UTC | #1
On Fri, Jun 16, 2017 at 04:23:38PM +0530, Nikunj A Dadhania wrote:
> Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> writes:
> 
> > Greg Kurz <groug@kaod.org> writes:
> >
> >> On Sun, 11 Jun 2017 17:38:42 +0800
> >> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>
> >>> On Fri, Jun 09, 2017 at 05:09:13PM +0200, Greg Kurz wrote:
> >>> > On Fri, 9 Jun 2017 20:28:32 +1000
> >>> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >>> >   
> >>> > > On Fri, Jun 09, 2017 at 11:36:31AM +0200, Greg Kurz wrote:  
> >>> > > > On Fri, 9 Jun 2017 12:28:13 +1000
> >>> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> >>> > > >     
> >>> > 1) start guest
> >>> > 
> >>> > qemu-system-ppc64 \
> >>> >  -nodefaults -nographic -snapshot -no-shutdown -serial mon:stdio \
> >>> >  -device virtio-net,netdev=netdev0,id=net0 \
> >>> >  -netdev bridge,id=netdev0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
> >>> >  -device virtio-blk,drive=drive0,id=blk0 \
> >>> >  -drive file=/home/greg/images/sle12-sp1-ppc64le.qcow2,id=drive0,if=none \
> >>> >  -machine type=pseries,accel=tcg -cpu POWER8
> >
> > Strangely, your command line does not have multiple threads. Need to see
> > what is the side effect of enabling MTTCG by default here.
> >
> >>> > 
> >>> > 2) migrate
> >>> > 
> >>> > 3) destination crashes (immediately or after very short delay) or
> >>> > hangs  
> >>> 
> >>> Ok.  I'll bisect it when I can, but you might well get to it first.
> >>> 
> >>> 
> >>
> >> Heh, maybe you didn't see in my mail but I did bisect:
> >>
> >> f0b0685d6694a28c66018f438e822596243b1250 is the first bad commit
> >> commit f0b0685d6694a28c66018f438e822596243b1250
> >> Author: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
> >> Date:   Thu Apr 27 10:48:23 2017 +0530
> >>
> >>     tcg: enable MTTCG by default for PPC64 on x86
> >
> > Let me have a look at it.
> 
> Interesting problem here, I see that when the migration is completed on
> source and there is a crash on destination:
> 
> [   56.185314] Unable to handle kernel paging request for data at address 0x5deadbeef0000108
> [   56.185401] Faulting instruction address: 0xc000000000277bc8
> 
>    0xc000000000277bb8 <+168>:	ld      r7,8(r4)
>    0xc000000000277bbc <+172>:	ld      r6,0(r4)                  <========
>    0xc000000000277bc0 <+176>:	ori     r8,r8,56302
>    0xc000000000277bc4 <+180>:	rldicr  r8,r8,32,31
>    0xc000000000277bc8 <+184>:	std     r7,8(r6)
> 
> r4 = 0xf0000000000107a0
> r6 = 0x5deadbeef0000100
> 
> Code at 0xc000000000277bbc <+172>, gave junk value in r6, that leads to
> the guest crash. When I inspect the memory on source and destination in
> qemu monitor, I get the following differences:
> 
> diff -u s.txt d.txt 
> --- s.txt	2017-06-16 10:34:39.657221125 +0530
> +++ d.txt	2017-06-16 10:34:18.452238305 +0530
> @@ -8,8 +8,8 @@
>  f000000000010760: 0x20de0b00 0x000000f0 0x60040100 0x000000f0
>  f000000000010770: 0x00000000 0x00000000 0x0004036d 0x000000c0
>  f000000000010780: 0x6c000100 0xf8ff3f00 0x7817f977 0x000000c0
> -f000000000010790: 0x15000000 0x00000000 0xffffffff 0x01000000
> -f0000000000107a0: 0x3090a96d 0x000000c0 0x3090a96d 0x000000c0
> +f000000000010790: 0x01000000 0x00000000 0xffffffff 0x01000000
> +f0000000000107a0: 0x000100f0 0xeedbea5d 0x000200f0 0xeedbea5d
>  f0000000000107b0: 0x00000000 0x00000000 0x00d0a96d 0x000000c0
>  f0000000000107c0: 0x28000000 0xf8ff3f00 0x8852cc77 0x000000c0
>  f0000000000107d0: 0x00000000 0x00000000 0xffffffff 0x01000000
> 
> Source had a valid address at 0xf0000000000107a0, while garbage on the
> destination.
> 
> Some observations:
> 
> * Source updates the memory location (probably atomic_cmpxchg), but the
>   updated page didnt get transferred to the destination
>   
> * Getting rid of atomic_cmpxchg tcg ops in ldarx/stdcx, makes migration
>   work fine. MTTCG running with 1 cpu.
> 
> While I continue debugging, any hints would help.

My first guess would be that some or all of the new TCG atomic
primitives aren't updating the dirty page bitmap.

My second guess would be a race between the atomic TCG ops and the
migration / dirty map handling which means we can lost a memory update
and not transfer it to the destination.
diff mbox

Patch

diff -u s.txt d.txt 
--- s.txt	2017-06-16 10:34:39.657221125 +0530
+++ d.txt	2017-06-16 10:34:18.452238305 +0530
@@ -8,8 +8,8 @@ 
 f000000000010760: 0x20de0b00 0x000000f0 0x60040100 0x000000f0
 f000000000010770: 0x00000000 0x00000000 0x0004036d 0x000000c0
 f000000000010780: 0x6c000100 0xf8ff3f00 0x7817f977 0x000000c0
-f000000000010790: 0x15000000 0x00000000 0xffffffff 0x01000000
-f0000000000107a0: 0x3090a96d 0x000000c0 0x3090a96d 0x000000c0
+f000000000010790: 0x01000000 0x00000000 0xffffffff 0x01000000
+f0000000000107a0: 0x000100f0 0xeedbea5d 0x000200f0 0xeedbea5d
 f0000000000107b0: 0x00000000 0x00000000 0x00d0a96d 0x000000c0
 f0000000000107c0: 0x28000000 0xf8ff3f00 0x8852cc77 0x000000c0
 f0000000000107d0: 0x00000000 0x00000000 0xffffffff 0x01000000