diff mbox series

s390x TCG migration failure

Message ID 20230324184129.3119575-1-nsg@linux.ibm.com
State New
Headers show
Series s390x TCG migration failure | expand

Commit Message

Nina Schoetterl-Glausch March 24, 2023, 6:41 p.m. UTC
Hi,

We're seeing failures running s390x migration kvm-unit-tests tests with TCG.
Some initial findings:
What seems to be happening is that after migration a control block header accessed by the test code is all zeros which causes an unexpected exception.
I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit.
The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef.

Applying


on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.)
I arrived at this by experimentation, I haven't looked into why this makes a difference.

Any thoughts on the matter appreciated.

Comments

Thomas Huth March 28, 2023, 1:01 p.m. UTC | #1
On 24/03/2023 19.41, Nina Schoetterl-Glausch wrote:
> Hi,
> 
> We're seeing failures running s390x migration kvm-unit-tests tests with TCG.
> Some initial findings:
> What seems to be happening is that after migration a control block header accessed by the test code is all zeros which causes an unexpected exception.
> I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit.
> The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef.

  Hi Nina,

could you please open a ticket in the QEMU bug tracker and add the "8.0" 
label there, so that it correctly shows up in the list of things that should 
be fixed before the 8.0 release?

> Applying
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 56ff9cd29d..2dc546cf28 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, uint64_t max_size,
>   
>       uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
>   
> -    if (!migration_in_postcopy()) {
> +    if (!migration_in_postcopy() && remaining_size < max_size) {
>           qemu_mutex_lock_iothread();
>           WITH_RCU_READ_LOCK_GUARD() {
>               migration_bitmap_sync_precopy(rs);
> 
> on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.)
> I arrived at this by experimentation, I haven't looked into why this makes a difference.
> 
> Any thoughts on the matter appreciated.

Juan, could you comment on this, please?

  Thomas
Nina Schoetterl-Glausch March 28, 2023, 10:21 p.m. UTC | #2
On Tue, 2023-03-28 at 15:01 +0200, Thomas Huth wrote:
> On 24/03/2023 19.41, Nina Schoetterl-Glausch wrote:
> > Hi,
> > 
> > We're seeing failures running s390x migration kvm-unit-tests tests with TCG.
> > Some initial findings:
> > What seems to be happening is that after migration a control block header accessed by the test code is all zeros which causes an unexpected exception.
> > I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit.
> > The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef.
> 
>   Hi Nina,
> 
> could you please open a ticket in the QEMU bug tracker and add the "8.0" 
> label there, so that it correctly shows up in the list of things that should 
> be fixed before the 8.0 release?

https://gitlab.com/qemu-project/qemu/-/issues/1565

Not sure if I cannot add a label or if I overlooked how to.
Thomas Huth March 29, 2023, 6:36 a.m. UTC | #3
On 29/03/2023 00.21, Nina Schoetterl-Glausch wrote:
> On Tue, 2023-03-28 at 15:01 +0200, Thomas Huth wrote:
>> On 24/03/2023 19.41, Nina Schoetterl-Glausch wrote:
>>> Hi,
>>>
>>> We're seeing failures running s390x migration kvm-unit-tests tests with TCG.
>>> Some initial findings:
>>> What seems to be happening is that after migration a control block header accessed by the test code is all zeros which causes an unexpected exception.
>>> I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit.
>>> The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef.
>>
>>    Hi Nina,
>>
>> could you please open a ticket in the QEMU bug tracker and add the "8.0"
>> label there, so that it correctly shows up in the list of things that should
>> be fixed before the 8.0 release?
> 
> https://gitlab.com/qemu-project/qemu/-/issues/1565 

Thanks!

> Not sure if I cannot add a label or if I overlooked how to.

Ah, you might need to be at least a "Reviewer" in the qemu-project to be 
able to add labels and milestones. You can ask one of the owners or 
maintainers (see https://gitlab.com/qemu-project/qemu/-/project_members ) to 
add you as a reviewer.

Anyway, I added the 8.0 milestone now to the ticket.

  Thomas
Thomas Huth April 4, 2023, 3:18 p.m. UTC | #4
On 29/03/2023 08.36, Thomas Huth wrote:
> On 29/03/2023 00.21, Nina Schoetterl-Glausch wrote:
>> On Tue, 2023-03-28 at 15:01 +0200, Thomas Huth wrote:
>>> On 24/03/2023 19.41, Nina Schoetterl-Glausch wrote:
>>>> Hi,
>>>>
>>>> We're seeing failures running s390x migration kvm-unit-tests tests with 
>>>> TCG.
>>>> Some initial findings:
>>>> What seems to be happening is that after migration a control block 
>>>> header accessed by the test code is all zeros which causes an unexpected 
>>>> exception.
>>>> I did a bisection which points to c8df4a7aef ("migration: Split 
>>>> save_live_pending() into state_pending_*") as the culprit.
>>>> The migration issue persists after applying the fix e264705012 
>>>> ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef.
>>>
>>>    Hi Nina,
>>>
>>> could you please open a ticket in the QEMU bug tracker and add the "8.0"
>>> label there, so that it correctly shows up in the list of things that should
>>> be fixed before the 8.0 release?
>>
>> https://gitlab.com/qemu-project/qemu/-/issues/1565 
> 
> Thanks!

Ping!

Juan, did you have a chance to look at this issue yet? ... we're getting 
quite close to the 8.0 release, and IMHO this sounds like a critical bug?

  Thomas
Juan Quintela April 12, 2023, 8:31 p.m. UTC | #5
Nina Schoetterl-Glausch <nsg@linux.ibm.com> wrote:
> Hi,
>
> We're seeing failures running s390x migration kvm-unit-tests tests with TCG.
> Some initial findings:
> What seems to be happening is that after migration a control block header accessed by the test code is all zeros which causes an unexpected exception.
> I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit.
> The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef.
>
> Applying
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 56ff9cd29d..2dc546cf28 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, uint64_t max_size,
>  
>      uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
>  
> -    if (!migration_in_postcopy()) {
> +    if (!migration_in_postcopy() && remaining_size < max_size) {
>          qemu_mutex_lock_iothread();
>          WITH_RCU_READ_LOCK_GUARD() {
>              migration_bitmap_sync_precopy(rs);
>
> on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.)
> I arrived at this by experimentation, I haven't looked into why this makes a difference.
>
> Any thoughts on the matter appreciated.

Ouch, you are right.
Good catch.

Queued the fix.

Later, Juan.
Juan Quintela April 12, 2023, 8:46 p.m. UTC | #6
Nina Schoetterl-Glausch <nsg@linux.ibm.com> wrote:
> Hi,
>
> We're seeing failures running s390x migration kvm-unit-tests tests with TCG.
> Some initial findings:
> What seems to be happening is that after migration a control block header accessed by the test code is all zeros which causes an unexpected exception.
> I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit.
> The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef.
>
> Applying
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 56ff9cd29d..2dc546cf28 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, uint64_t max_size,
>  
>      uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
>  
> -    if (!migration_in_postcopy()) {
> +    if (!migration_in_postcopy() && remaining_size < max_size) {
>          qemu_mutex_lock_iothread();
>          WITH_RCU_READ_LOCK_GUARD() {
>              migration_bitmap_sync_precopy(rs);
>
> on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.)
> I arrived at this by experimentation, I haven't looked into why this makes a difference.

> Any thoughts on the matter appreciated.

This shouldn't be happening.  Famous last words.
I am still applying the patch, to get back to old behaviour, but we
shouldn't be needing this.

Basically when we call ram_state_pending_exact() we know that we want to
sync the bitmap.  But I guess that dirty block bitmap can be
"interesting" to say the less.

Later, Juan.
Juan Quintela April 12, 2023, 9:01 p.m. UTC | #7
Nina Schoetterl-Glausch <nsg@linux.ibm.com> wrote:
> Hi,
>
> We're seeing failures running s390x migration kvm-unit-tests tests with TCG.

As this is tcg, could you tell the exact command that you are running?
Does it needs to be in s390x host, rigth?

$ time ./tests/qtest/migration-test
# random seed: R02S940c4f22abc48b14868566639d3d6c77
# Skipping test: s390x host with KVM is required
1..0

real	0m0.003s
user	0m0.002s
sys	0m0.001s


> Some initial findings:
> What seems to be happening is that after migration a control block
> header accessed by the test code is all zeros which causes an
> unexpected exception.

What exception?

What do you mean here by control block header?

> I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit.
> The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef.
>
> Applying
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 56ff9cd29d..2dc546cf28 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, uint64_t max_size,
>  
>      uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
>  
> -    if (!migration_in_postcopy()) {
> +    if (!migration_in_postcopy() && remaining_size < max_size) {

If block is all zeros, then remaining_size should be zero, so always
smaller than max_size.

I don't really fully understand what is going here.

>          qemu_mutex_lock_iothread();
>          WITH_RCU_READ_LOCK_GUARD() {
>              migration_bitmap_sync_precopy(rs);
>
> on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.)
> I arrived at this by experimentation, I haven't looked into why this makes a difference.
>
> Any thoughts on the matter appreciated.

Later, Juan.
Nina Schoetterl-Glausch April 13, 2023, 11:42 a.m. UTC | #8
On Wed, 2023-04-12 at 23:01 +0200, Juan Quintela wrote:
> Nina Schoetterl-Glausch <nsg@linux.ibm.com> wrote:
> > Hi,
> > 
> > We're seeing failures running s390x migration kvm-unit-tests tests with TCG.
> 
> As this is tcg, could you tell the exact command that you are running?
> Does it needs to be in s390x host, rigth?

I've just tried with a cross compile of kvm-unit-tests and that fails, too.

git clone https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git
cd kvm-unit-tests/
./configure --cross-prefix=s390x-linux-gnu- --arch=s390x
make
for i in {0..30}; do echo $i; QEMU=../qemu/build/qemu-system-s390x ACCEL=tcg ./run_tests.sh migration-skey-sequential | grep FAIL && break; done

> 
> $ time ./tests/qtest/migration-test

I haven't looked if that test fails at all, we just noticed it with the kvm-unit-tests.

> # random seed: R02S940c4f22abc48b14868566639d3d6c77
> # Skipping test: s390x host with KVM is required
> 1..0
> 
> real	0m0.003s
> user	0m0.002s
> sys	0m0.001s
> 
> 
> > Some initial findings:
> > What seems to be happening is that after migration a control block
> > header accessed by the test code is all zeros which causes an
> > unexpected exception.
> 
> What exception?
> 
> What do you mean here by control block header?

It's all s390x test guest specific stuff, I don't expect it to be too helpful.
The guest gets a specification exception program interrupt while executing a SERVC because
the SCCB control block is invalid.

See https://gitlab.com/qemu-project/qemu/-/issues/1565 for a code snippet.
The guest sets a bunch of fields in the SCCB header, but when TCG emulates the SERVC,
they are zero which doesn't make sense.

> 
> > I did a bisection which points to c8df4a7aef ("migration: Split save_live_pending() into state_pending_*") as the culprit.
> > The migration issue persists after applying the fix e264705012 ("migration: I messed state_pending_exact/estimate") on top of c8df4a7aef.
> > 
> > Applying
> > 
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 56ff9cd29d..2dc546cf28 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -3437,7 +3437,7 @@ static void ram_state_pending_exact(void *opaque, uint64_t max_size,
> >  
> >      uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> >  
> > -    if (!migration_in_postcopy()) {
> > +    if (!migration_in_postcopy() && remaining_size < max_size) {
> 
> If block is all zeros, then remaining_size should be zero, so always
> smaller than max_size.
> 
> I don't really fully understand what is going here.
> 
> >          qemu_mutex_lock_iothread();
> >          WITH_RCU_READ_LOCK_GUARD() {
> >              migration_bitmap_sync_precopy(rs);
> > 
> > on top fixes or hides the issue. (The comparison was removed by c8df4a7aef.)
> > I arrived at this by experimentation, I haven't looked into why this makes a difference.
> > 
> > Any thoughts on the matter appreciated.
> 
> Later, Juan.
>
diff mbox series

Patch

diff --git a/migration/ram.c b/migration/ram.c
index 56ff9cd29d..2dc546cf28 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3437,7 +3437,7 @@  static void ram_state_pending_exact(void *opaque, uint64_t max_size,
 
     uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
 
-    if (!migration_in_postcopy()) {
+    if (!migration_in_postcopy() && remaining_size < max_size) {
         qemu_mutex_lock_iothread();
         WITH_RCU_READ_LOCK_GUARD() {
             migration_bitmap_sync_precopy(rs);