diff mbox series

[2/2] migration: Fix return-path thread exit

Message ID 20240201184853.890471-3-clg@redhat.com
State New
Headers show
Series migration: Fix return-path thread exit | expand

Commit Message

Cédric Le Goater Feb. 1, 2024, 6:48 p.m. UTC
In case of error, close_return_path_on_source() can perform a shutdown
to exit the return-path thread.  However, in migrate_fd_cleanup(),
'to_dst_file' is closed before calling close_return_path_on_source()
and the shutdown fails, leaving the source and destination waiting for
an event to occur.

Close the file after calling close_return_path_on_source() so that the
shutdown succeeds and the return-path thread exits.

Signed-off-by: Cédric Le Goater <clg@redhat.com>
---
 migration/migration.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

Comments

Fabiano Rosas Feb. 2, 2024, 2:42 p.m. UTC | #1
Cédric Le Goater <clg@redhat.com> writes:

> In case of error, close_return_path_on_source() can perform a shutdown
> to exit the return-path thread.  However, in migrate_fd_cleanup(),
> 'to_dst_file' is closed before calling close_return_path_on_source()
> and the shutdown fails, leaving the source and destination waiting for
> an event to occur.

At close_return_path_on_source, qemu_file_shutdown() and checking
ms->to_dst_file are done under the qemu_file_lock, so how could
migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
check have passed?

>
> Close the file after calling close_return_path_on_source() so that the
> shutdown succeeds and the return-path thread exits.
>
> Signed-off-by: Cédric Le Goater <clg@redhat.com>
> ---
>  migration/migration.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 2c3362235c7651c11d581f3c3639571f1f9636ef..1e0b6acaedc272e8ce26ad40be2c42177f5fd14e 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1314,6 +1314,7 @@ void migrate_set_state(int *state, int old_state, int new_state)
>  static void migrate_fd_cleanup(MigrationState *s)
>  {
>      int file_error = 0;
> +    QEMUFile *tmp = NULL;
>  
>      g_free(s->hostname);
>      s->hostname = NULL;
> @@ -1323,8 +1324,6 @@ static void migrate_fd_cleanup(MigrationState *s)
>      qemu_savevm_state_cleanup();
>  
>      if (s->to_dst_file) {
> -        QEMUFile *tmp;
> -
>          trace_migrate_fd_cleanup();
>          bql_unlock();
>          if (s->migration_thread_running) {
> @@ -1344,15 +1343,14 @@ static void migrate_fd_cleanup(MigrationState *s)
>           * critical section won't block for long.
>           */
>          migration_ioc_unregister_yank_from_file(tmp);
> -        qemu_fclose(tmp);
>      }
>  
> -    /*
> -     * We already cleaned up to_dst_file, so errors from the return
> -     * path might be due to that, ignore them.
> -     */
>      close_return_path_on_source(s, file_error);
>  
> +    if (tmp) {
> +        qemu_fclose(tmp);
> +    }
> +
>      assert(!migration_is_active(s));
>  
>      if (s->state == MIGRATION_STATUS_CANCELLING) {
Cédric Le Goater Feb. 2, 2024, 2:51 p.m. UTC | #2
On 2/2/24 15:42, Fabiano Rosas wrote:
> Cédric Le Goater <clg@redhat.com> writes:
> 
>> In case of error, close_return_path_on_source() can perform a shutdown
>> to exit the return-path thread.  However, in migrate_fd_cleanup(),
>> 'to_dst_file' is closed before calling close_return_path_on_source()
>> and the shutdown fails, leaving the source and destination waiting for
>> an event to occur.
> 
> At close_return_path_on_source, qemu_file_shutdown() and checking
> ms->to_dst_file are done under the qemu_file_lock, so how could
> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
> check have passed?

This is not a locking issue, it's much simpler. migrate_fd_cleanup()
clears the ms->to_dst_file pointer and closes the QEMUFile and then
calls close_return_path_on_source() which then tries to use resources
which are not available anymore.

Thanks,

C.




> 
>>
>> Close the file after calling close_return_path_on_source() so that the
>> shutdown succeeds and the return-path thread exits.
>>
>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>> ---
>>   migration/migration.c | 12 +++++-------
>>   1 file changed, 5 insertions(+), 7 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 2c3362235c7651c11d581f3c3639571f1f9636ef..1e0b6acaedc272e8ce26ad40be2c42177f5fd14e 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1314,6 +1314,7 @@ void migrate_set_state(int *state, int old_state, int new_state)
>>   static void migrate_fd_cleanup(MigrationState *s)
>>   {
>>       int file_error = 0;
>> +    QEMUFile *tmp = NULL;
>>   
>>       g_free(s->hostname);
>>       s->hostname = NULL;
>> @@ -1323,8 +1324,6 @@ static void migrate_fd_cleanup(MigrationState *s)
>>       qemu_savevm_state_cleanup();
>>   
>>       if (s->to_dst_file) {
>> -        QEMUFile *tmp;
>> -
>>           trace_migrate_fd_cleanup();
>>           bql_unlock();
>>           if (s->migration_thread_running) {
>> @@ -1344,15 +1343,14 @@ static void migrate_fd_cleanup(MigrationState *s)
>>            * critical section won't block for long.
>>            */
>>           migration_ioc_unregister_yank_from_file(tmp);
>> -        qemu_fclose(tmp);
>>       }
>>   
>> -    /*
>> -     * We already cleaned up to_dst_file, so errors from the return
>> -     * path might be due to that, ignore them.
>> -     */
>>       close_return_path_on_source(s, file_error);
>>   
>> +    if (tmp) {
>> +        qemu_fclose(tmp);
>> +    }
>> +
>>       assert(!migration_is_active(s));
>>   
>>       if (s->state == MIGRATION_STATUS_CANCELLING) {
>
Fabiano Rosas Feb. 2, 2024, 3:11 p.m. UTC | #3
Cédric Le Goater <clg@redhat.com> writes:

> On 2/2/24 15:42, Fabiano Rosas wrote:
>> Cédric Le Goater <clg@redhat.com> writes:
>> 
>>> In case of error, close_return_path_on_source() can perform a shutdown
>>> to exit the return-path thread.  However, in migrate_fd_cleanup(),
>>> 'to_dst_file' is closed before calling close_return_path_on_source()
>>> and the shutdown fails, leaving the source and destination waiting for
>>> an event to occur.
>> 
>> At close_return_path_on_source, qemu_file_shutdown() and checking
>> ms->to_dst_file are done under the qemu_file_lock, so how could
>> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
>> check have passed?
>
> This is not a locking issue, it's much simpler. migrate_fd_cleanup()
> clears the ms->to_dst_file pointer and closes the QEMUFile and then
> calls close_return_path_on_source() which then tries to use resources
> which are not available anymore.

I'm missing something here. Which resources? I assume you're talking
about this:

    WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
            qemu_file_get_error(ms->to_dst_file)) {
            qemu_file_shutdown(ms->rp_state.from_dst_file);
        }
    }

How do we get past the 'if (ms->to_dst_file)'?
Peter Xu Feb. 5, 2024, 3:37 a.m. UTC | #4
On Fri, Feb 02, 2024 at 12:11:09PM -0300, Fabiano Rosas wrote:
> Cédric Le Goater <clg@redhat.com> writes:
> 
> > On 2/2/24 15:42, Fabiano Rosas wrote:
> >> Cédric Le Goater <clg@redhat.com> writes:
> >> 
> >>> In case of error, close_return_path_on_source() can perform a shutdown
> >>> to exit the return-path thread.  However, in migrate_fd_cleanup(),
> >>> 'to_dst_file' is closed before calling close_return_path_on_source()
> >>> and the shutdown fails, leaving the source and destination waiting for
> >>> an event to occur.
> >> 
> >> At close_return_path_on_source, qemu_file_shutdown() and checking
> >> ms->to_dst_file are done under the qemu_file_lock, so how could
> >> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
> >> check have passed?
> >
> > This is not a locking issue, it's much simpler. migrate_fd_cleanup()
> > clears the ms->to_dst_file pointer and closes the QEMUFile and then
> > calls close_return_path_on_source() which then tries to use resources
> > which are not available anymore.
> 
> I'm missing something here. Which resources? I assume you're talking
> about this:
> 
>     WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
>         if (ms->to_dst_file && ms->rp_state.from_dst_file &&
>             qemu_file_get_error(ms->to_dst_file)) {
>             qemu_file_shutdown(ms->rp_state.from_dst_file);
>         }
>     }
> 
> How do we get past the 'if (ms->to_dst_file)'?

We don't; migrate_fd_cleanup() will release ms->to_dst_file, then call
close_return_path_on_source(), found that to_dst_file==NULL and then skip
the shutdown().

One other option might be that we do close_return_path_on_source() before
the chunk of releasing to_dst_file.

This "two qemufiles share the same ioc" issue had bitten us before IIRC,
and the only concern of that workaround is we keep postponing resolution of
the real issue, then we keep getting bitten by it..

Maybe we can wait a few days to see if Dan can join the conversation and if
we can reach a consensus on a complete solution.  Otherwise I think we can
still work this around, but maybe that'll require a comment block
explaining the bits after such movement.

Thanks,
Cédric Le Goater Feb. 5, 2024, 10:17 a.m. UTC | #5
On 2/5/24 04:37, Peter Xu wrote:
> On Fri, Feb 02, 2024 at 12:11:09PM -0300, Fabiano Rosas wrote:
>> Cédric Le Goater <clg@redhat.com> writes:
>>
>>> On 2/2/24 15:42, Fabiano Rosas wrote:
>>>> Cédric Le Goater <clg@redhat.com> writes:
>>>>
>>>>> In case of error, close_return_path_on_source() can perform a shutdown
>>>>> to exit the return-path thread.  However, in migrate_fd_cleanup(),
>>>>> 'to_dst_file' is closed before calling close_return_path_on_source()
>>>>> and the shutdown fails, leaving the source and destination waiting for
>>>>> an event to occur.
>>>>
>>>> At close_return_path_on_source, qemu_file_shutdown() and checking
>>>> ms->to_dst_file are done under the qemu_file_lock, so how could
>>>> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
>>>> check have passed?
>>>
>>> This is not a locking issue, it's much simpler. migrate_fd_cleanup()
>>> clears the ms->to_dst_file pointer and closes the QEMUFile and then
>>> calls close_return_path_on_source() which then tries to use resources
>>> which are not available anymore.
>>
>> I'm missing something here. Which resources? I assume you're talking
>> about this:
>>
>>      WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
>>          if (ms->to_dst_file && ms->rp_state.from_dst_file &&
>>              qemu_file_get_error(ms->to_dst_file)) {
>>              qemu_file_shutdown(ms->rp_state.from_dst_file);
>>          }
>>      }
>>
>> How do we get past the 'if (ms->to_dst_file)'?
> 
> We don't; migrate_fd_cleanup() will release ms->to_dst_file, then call
> close_return_path_on_source(), found that to_dst_file==NULL and then skip
> the shutdown().
> 
> One other option might be that we do close_return_path_on_source() before
> the chunk of releasing to_dst_file.
> 
> This "two qemufiles share the same ioc" issue had bitten us before IIRC,
> and the only concern of that workaround is we keep postponing resolution of
> the real issue, then we keep getting bitten by it..
> 
> Maybe we can wait a few days to see if Dan can join the conversation and if
> we can reach a consensus on a complete solution.  Otherwise I think we can
> still work this around, but maybe that'll require a comment block
> explaining the bits after such movement.

yes. The series should have been sent with an RFC.

I changed PATCH 1 to use migrate_has_error() instead of
qemu_file_get_error(ms->to_dst_file). I will keep PATCH 2 as it is for
the time being and wait for more feedback.

The prereq series adds an Error** argument to the .save_setup() and
.log_global*() handlers. I should send this week.

Thanks,

C.





> 
> Thanks,
>
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index 2c3362235c7651c11d581f3c3639571f1f9636ef..1e0b6acaedc272e8ce26ad40be2c42177f5fd14e 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1314,6 +1314,7 @@  void migrate_set_state(int *state, int old_state, int new_state)
 static void migrate_fd_cleanup(MigrationState *s)
 {
     int file_error = 0;
+    QEMUFile *tmp = NULL;
 
     g_free(s->hostname);
     s->hostname = NULL;
@@ -1323,8 +1324,6 @@  static void migrate_fd_cleanup(MigrationState *s)
     qemu_savevm_state_cleanup();
 
     if (s->to_dst_file) {
-        QEMUFile *tmp;
-
         trace_migrate_fd_cleanup();
         bql_unlock();
         if (s->migration_thread_running) {
@@ -1344,15 +1343,14 @@  static void migrate_fd_cleanup(MigrationState *s)
          * critical section won't block for long.
          */
         migration_ioc_unregister_yank_from_file(tmp);
-        qemu_fclose(tmp);
     }
 
-    /*
-     * We already cleaned up to_dst_file, so errors from the return
-     * path might be due to that, ignore them.
-     */
     close_return_path_on_source(s, file_error);
 
+    if (tmp) {
+        qemu_fclose(tmp);
+    }
+
     assert(!migration_is_active(s));
 
     if (s->state == MIGRATION_STATUS_CANCELLING) {