[00/14] migration/multifd: Refactor ->send_prepare() and cleanups

Message ID	20240131103111.306523-1-peterx@redhat.com
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: peterx@redhat.com To: qemu-devel@nongnu.org Cc: Bryan Zhang <bryan.zhang@bytedance.com>, Prasad Pandit <ppandit@redhat.com>, Fabiano Rosas <farosas@suse.de>, peterx@redhat.com, Yuan Liu <yuan1.liu@intel.com>, Avihai Horon <avihaih@nvidia.com>, Hao Xiang <hao.xiang@bytedance.com> Subject: [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups Date: Wed, 31 Jan 2024 18:30:57 +0800 Message-ID: <20240131103111.306523-1-peterx@redhat.com> Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.133.124; envelope-from=peterx@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -33 X-Spam_score: -3.4 X-Spam_bar: --- X-Spam_report: (-3.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-1.292, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Series	migration/multifd: Refactor ->send_prepare() and cleanups \| expand [00/14] migration/multifd: Refactor ->send_prepare() and cleanups [01/14] migration/multifd: Drop stale comment for multifd zero copy [02/14] migration/multifd: multifd_send_kick_main() [03/14] migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths [04/14] migration/multifd: Postpone reset of MultiFDPages_t [05/14] migration/multifd: Drop MultiFDSendParams.normal[] array [06/14] migration/multifd: Separate SYNC request with normal jobs [07/14] migration/multifd: Simplify locking in sender thread [08/14] migration/multifd: Drop pages->num check in sender thread [09/14] migration/multifd: Rename p->num_packets and clean it up [10/14] migration/multifd: Move total_normal_pages accounting [11/14] migration/multifd: Move trace_multifd_send\|recv() [12/14] migration/multifd: multifd_send_prepare_header() [13/14] migration/multifd: Move header prepare/fill into send_prepare() [14/14] migration/multifd: Forbid spurious wakeups

Peter Xu Jan. 31, 2024, 10:30 a.m. UTC

From: Peter Xu <peterx@redhat.com>

This patchset contains quite a few refactorings to current multifd:

  - It picked up some patches from an old series of mine [0] (the last
    patches were dropped, though; I did the cleanup slightly differently):

    I still managed to include one patch to split pending_job, but I
    rewrote the patch here.

  - It tries to cleanup multiple multifd paths here and there, the ultimate
    goal is to redefine send_prepare() to be something like:

      p->pages ----------->  send_prepare() -------------> IOVs

    So that there's no obvious change yet on multifd_ops besides redefined
    interface for send_prepare().  We may want a separate OPs for file
    later.

For 2), one benefit is already presented by Fabiano in his other series [1]
on cleaning up zero copy, but this patchset addressed it quite differently,
and hopefully also more gradually.  The other benefit is for sure if we
have a more concrete API for send_prepare() and if we can reach an initial
consensus, then we can have the recent compression accelerators rebased on
top of this one.

This also prepares for the case where the input can be extended to even not
any p->pages, but arbitrary data (like VFIO's potential use case in the
future?).  But that will also for later even if reasonable.

Please have a look.  Thanks,

[0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
[1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de

Peter Xu (14):
  migration/multifd: Drop stale comment for multifd zero copy
  migration/multifd: multifd_send_kick_main()
  migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
  migration/multifd: Postpone reset of MultiFDPages_t
  migration/multifd: Drop MultiFDSendParams.normal[] array
  migration/multifd: Separate SYNC request with normal jobs
  migration/multifd: Simplify locking in sender thread
  migration/multifd: Drop pages->num check in sender thread
  migration/multifd: Rename p->num_packets and clean it up
  migration/multifd: Move total_normal_pages accounting
  migration/multifd: Move trace_multifd_send|recv()
  migration/multifd: multifd_send_prepare_header()
  migration/multifd: Move header prepare/fill into send_prepare()
  migration/multifd: Forbid spurious wakeups

 migration/multifd.h      |  34 +++--
 migration/multifd-zlib.c |  11 +-
 migration/multifd-zstd.c |  11 +-
 migration/multifd.c      | 291 +++++++++++++++++++--------------------
 4 files changed, 182 insertions(+), 165 deletions(-)

Fabiano Rosas Jan. 31, 2024, 10:49 p.m. UTC | #1

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> This patchset contains quite a few refactorings to current multifd:
>
>   - It picked up some patches from an old series of mine [0] (the last
>     patches were dropped, though; I did the cleanup slightly differently):
>
>     I still managed to include one patch to split pending_job, but I
>     rewrote the patch here.
>
>   - It tries to cleanup multiple multifd paths here and there, the ultimate
>     goal is to redefine send_prepare() to be something like:
>
>       p->pages ----------->  send_prepare() -------------> IOVs
>
>     So that there's no obvious change yet on multifd_ops besides redefined
>     interface for send_prepare().  We may want a separate OPs for file
>     later.
>
> For 2), one benefit is already presented by Fabiano in his other series [1]
> on cleaning up zero copy, but this patchset addressed it quite differently,
> and hopefully also more gradually.  The other benefit is for sure if we
> have a more concrete API for send_prepare() and if we can reach an initial
> consensus, then we can have the recent compression accelerators rebased on
> top of this one.
>
> This also prepares for the case where the input can be extended to even not
> any p->pages, but arbitrary data (like VFIO's potential use case in the
> future?).  But that will also for later even if reasonable.
>
> Please have a look.  Thanks,
>
> [0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
> [1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de
>
> Peter Xu (14):
>   migration/multifd: Drop stale comment for multifd zero copy
>   migration/multifd: multifd_send_kick_main()
>   migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
>   migration/multifd: Postpone reset of MultiFDPages_t
>   migration/multifd: Drop MultiFDSendParams.normal[] array
>   migration/multifd: Separate SYNC request with normal jobs
>   migration/multifd: Simplify locking in sender thread
>   migration/multifd: Drop pages->num check in sender thread
>   migration/multifd: Rename p->num_packets and clean it up
>   migration/multifd: Move total_normal_pages accounting
>   migration/multifd: Move trace_multifd_send|recv()
>   migration/multifd: multifd_send_prepare_header()
>   migration/multifd: Move header prepare/fill into send_prepare()
>   migration/multifd: Forbid spurious wakeups
>
>  migration/multifd.h      |  34 +++--
>  migration/multifd-zlib.c |  11 +-
>  migration/multifd-zstd.c |  11 +-
>  migration/multifd.c      | 291 +++++++++++++++++++--------------------
>  4 files changed, 182 insertions(+), 165 deletions(-)

This series didn't survive my 9999 iterations test on the opensuse
machine.

# Running /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
...
kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)


#0  0x00005575dda06399 in qemu_mutex_lock_impl (mutex=0x18, file=0x5575ddce9cc3 "../util/qemu-thread-posix.c", line=275) at ../util/qemu-thread-posix.c:92
#1  0x00005575dda06a94 in qemu_sem_post (sem=0x18) at ../util/qemu-thread-posix.c:275
#2  0x00005575dd56a512 in multifd_send_thread (opaque=0x5575df054ef8) at ../migration/multifd.c:720
#3  0x00005575dda0709b in qemu_thread_start (args=0x7fd404001d50) at ../util/qemu-thread-posix.c:541
#4  0x00007fd45e8a26ea in start_thread (arg=0x7fd3faffd700) at pthread_create.c:477
#5  0x00007fd45cd2150f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The multifd thread is posting channels_ready with an already freed
multifd_send_state.

This is the bug Avihai has hit. We're going into multifd_save_cleanup()
so early that multifd_new_send_channel_async() hasn't even had the
chance to set p->running. So it misses the join and frees everything up
while a second multifd thread is just starting.

Peter Xu Feb. 1, 2024, 5:47 a.m. UTC | #2

On Wed, Jan 31, 2024 at 07:49:51PM -0300, Fabiano Rosas wrote:
> peterx@redhat.com writes:
> 
> > From: Peter Xu <peterx@redhat.com>
> >
> > This patchset contains quite a few refactorings to current multifd:
> >
> >   - It picked up some patches from an old series of mine [0] (the last
> >     patches were dropped, though; I did the cleanup slightly differently):
> >
> >     I still managed to include one patch to split pending_job, but I
> >     rewrote the patch here.
> >
> >   - It tries to cleanup multiple multifd paths here and there, the ultimate
> >     goal is to redefine send_prepare() to be something like:
> >
> >       p->pages ----------->  send_prepare() -------------> IOVs
> >
> >     So that there's no obvious change yet on multifd_ops besides redefined
> >     interface for send_prepare().  We may want a separate OPs for file
> >     later.
> >
> > For 2), one benefit is already presented by Fabiano in his other series [1]
> > on cleaning up zero copy, but this patchset addressed it quite differently,
> > and hopefully also more gradually.  The other benefit is for sure if we
> > have a more concrete API for send_prepare() and if we can reach an initial
> > consensus, then we can have the recent compression accelerators rebased on
> > top of this one.
> >
> > This also prepares for the case where the input can be extended to even not
> > any p->pages, but arbitrary data (like VFIO's potential use case in the
> > future?).  But that will also for later even if reasonable.
> >
> > Please have a look.  Thanks,
> >
> > [0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
> > [1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de
> >
> > Peter Xu (14):
> >   migration/multifd: Drop stale comment for multifd zero copy
> >   migration/multifd: multifd_send_kick_main()
> >   migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
> >   migration/multifd: Postpone reset of MultiFDPages_t
> >   migration/multifd: Drop MultiFDSendParams.normal[] array
> >   migration/multifd: Separate SYNC request with normal jobs
> >   migration/multifd: Simplify locking in sender thread
> >   migration/multifd: Drop pages->num check in sender thread
> >   migration/multifd: Rename p->num_packets and clean it up
> >   migration/multifd: Move total_normal_pages accounting
> >   migration/multifd: Move trace_multifd_send|recv()
> >   migration/multifd: multifd_send_prepare_header()
> >   migration/multifd: Move header prepare/fill into send_prepare()
> >   migration/multifd: Forbid spurious wakeups
> >
> >  migration/multifd.h      |  34 +++--
> >  migration/multifd-zlib.c |  11 +-
> >  migration/multifd-zstd.c |  11 +-
> >  migration/multifd.c      | 291 +++++++++++++++++++--------------------
> >  4 files changed, 182 insertions(+), 165 deletions(-)
> 
> This series didn't survive my 9999 iterations test on the opensuse
> machine.
> 
> # Running /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
> ...
> kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> 
> 
> #0  0x00005575dda06399 in qemu_mutex_lock_impl (mutex=0x18, file=0x5575ddce9cc3 "../util/qemu-thread-posix.c", line=275) at ../util/qemu-thread-posix.c:92
> #1  0x00005575dda06a94 in qemu_sem_post (sem=0x18) at ../util/qemu-thread-posix.c:275
> #2  0x00005575dd56a512 in multifd_send_thread (opaque=0x5575df054ef8) at ../migration/multifd.c:720
> #3  0x00005575dda0709b in qemu_thread_start (args=0x7fd404001d50) at ../util/qemu-thread-posix.c:541
> #4  0x00007fd45e8a26ea in start_thread (arg=0x7fd3faffd700) at pthread_create.c:477
> #5  0x00007fd45cd2150f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> The multifd thread is posting channels_ready with an already freed
> multifd_send_state.
> 
> This is the bug Avihai has hit. We're going into multifd_save_cleanup()
> so early that multifd_new_send_channel_async() hasn't even had the
> chance to set p->running. So it misses the join and frees everything up
> while a second multifd thread is just starting.

Thanks for doing that.

Would this series makes that bug easier to happen?  I didn't do a lot of
test on it, it only survived the smoke test and the kicked CI job.  I think
we can still decide to fix that issues separately; but if this series makes
that easier to happen then that's definitely bad..

Avihai Horon Feb. 1, 2024, 12:51 p.m. UTC | #3

On 01/02/2024 7:47, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Jan 31, 2024 at 07:49:51PM -0300, Fabiano Rosas wrote:
>> peterx@redhat.com writes:
>>
>>> From: Peter Xu <peterx@redhat.com>
>>>
>>> This patchset contains quite a few refactorings to current multifd:
>>>
>>>    - It picked up some patches from an old series of mine [0] (the last
>>>      patches were dropped, though; I did the cleanup slightly differently):
>>>
>>>      I still managed to include one patch to split pending_job, but I
>>>      rewrote the patch here.
>>>
>>>    - It tries to cleanup multiple multifd paths here and there, the ultimate
>>>      goal is to redefine send_prepare() to be something like:
>>>
>>>        p->pages ----------->  send_prepare() -------------> IOVs
>>>
>>>      So that there's no obvious change yet on multifd_ops besides redefined
>>>      interface for send_prepare().  We may want a separate OPs for file
>>>      later.
>>>
>>> For 2), one benefit is already presented by Fabiano in his other series [1]
>>> on cleaning up zero copy, but this patchset addressed it quite differently,
>>> and hopefully also more gradually.  The other benefit is for sure if we
>>> have a more concrete API for send_prepare() and if we can reach an initial
>>> consensus, then we can have the recent compression accelerators rebased on
>>> top of this one.
>>>
>>> This also prepares for the case where the input can be extended to even not
>>> any p->pages, but arbitrary data (like VFIO's potential use case in the
>>> future?).  But that will also for later even if reasonable.
>>>
>>> Please have a look.  Thanks,
>>>
>>> [0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
>>> [1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de
>>>
>>> Peter Xu (14):
>>>    migration/multifd: Drop stale comment for multifd zero copy
>>>    migration/multifd: multifd_send_kick_main()
>>>    migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
>>>    migration/multifd: Postpone reset of MultiFDPages_t
>>>    migration/multifd: Drop MultiFDSendParams.normal[] array
>>>    migration/multifd: Separate SYNC request with normal jobs
>>>    migration/multifd: Simplify locking in sender thread
>>>    migration/multifd: Drop pages->num check in sender thread
>>>    migration/multifd: Rename p->num_packets and clean it up
>>>    migration/multifd: Move total_normal_pages accounting
>>>    migration/multifd: Move trace_multifd_send|recv()
>>>    migration/multifd: multifd_send_prepare_header()
>>>    migration/multifd: Move header prepare/fill into send_prepare()
>>>    migration/multifd: Forbid spurious wakeups
>>>
>>>   migration/multifd.h      |  34 +++--
>>>   migration/multifd-zlib.c |  11 +-
>>>   migration/multifd-zstd.c |  11 +-
>>>   migration/multifd.c      | 291 +++++++++++++++++++--------------------
>>>   4 files changed, 182 insertions(+), 165 deletions(-)
>> This series didn't survive my 9999 iterations test on the opensuse
>> machine.
>>
>> # Running /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
>> ...
>> kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>>
>>
>> #0  0x00005575dda06399 in qemu_mutex_lock_impl (mutex=0x18, file=0x5575ddce9cc3 "../util/qemu-thread-posix.c", line=275) at ../util/qemu-thread-posix.c:92
>> #1  0x00005575dda06a94 in qemu_sem_post (sem=0x18) at ../util/qemu-thread-posix.c:275
>> #2  0x00005575dd56a512 in multifd_send_thread (opaque=0x5575df054ef8) at ../migration/multifd.c:720
>> #3  0x00005575dda0709b in qemu_thread_start (args=0x7fd404001d50) at ../util/qemu-thread-posix.c:541
>> #4  0x00007fd45e8a26ea in start_thread (arg=0x7fd3faffd700) at pthread_create.c:477
>> #5  0x00007fd45cd2150f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>
>> The multifd thread is posting channels_ready with an already freed
>> multifd_send_state.
>>
>> This is the bug Avihai has hit. We're going into multifd_save_cleanup()
>> so early that multifd_new_send_channel_async() hasn't even had the
>> chance to set p->running. So it misses the join and frees everything up
>> while a second multifd thread is just starting.
> Thanks for doing that.
>
> Would this series makes that bug easier to happen?

I think so.
Patch #3 added an extra multifd_send_should_exit() check in 
multifd_send_sync_main(), so now it can exit early if the first channel 
fails.
Plus, now migration state is set to FAILED early by:
multifd_new_send_channel_async()->multifd_send_terminate_threads() and 
multifd_tls_outgoing_handshake()->multifd_send_terminate_threads()
so migration_iteration_run() is completely skipped because 
migration_is_active() check before it will return false.

I *think* this is what makes main migration thread finish earlier and 
call multifd_save_cleanup() earlier, at least for me.

> I didn't do a lot of
> test on it, it only survived the smoke test and the kicked CI job.  I think
> we can still decide to fix that issues separately; but if this series makes
> that easier to happen then that's definitely bad..
>
> --
> Peter Xu
>

Fabiano Rosas Feb. 1, 2024, 9:46 p.m. UTC | #4

Avihai Horon <avihaih@nvidia.com> writes:

> On 01/02/2024 7:47, Peter Xu wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Wed, Jan 31, 2024 at 07:49:51PM -0300, Fabiano Rosas wrote:
>>> peterx@redhat.com writes:
>>>
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> This patchset contains quite a few refactorings to current multifd:
>>>>
>>>>    - It picked up some patches from an old series of mine [0] (the last
>>>>      patches were dropped, though; I did the cleanup slightly differently):
>>>>
>>>>      I still managed to include one patch to split pending_job, but I
>>>>      rewrote the patch here.
>>>>
>>>>    - It tries to cleanup multiple multifd paths here and there, the ultimate
>>>>      goal is to redefine send_prepare() to be something like:
>>>>
>>>>        p->pages ----------->  send_prepare() -------------> IOVs
>>>>
>>>>      So that there's no obvious change yet on multifd_ops besides redefined
>>>>      interface for send_prepare().  We may want a separate OPs for file
>>>>      later.
>>>>
>>>> For 2), one benefit is already presented by Fabiano in his other series [1]
>>>> on cleaning up zero copy, but this patchset addressed it quite differently,
>>>> and hopefully also more gradually.  The other benefit is for sure if we
>>>> have a more concrete API for send_prepare() and if we can reach an initial
>>>> consensus, then we can have the recent compression accelerators rebased on
>>>> top of this one.
>>>>
>>>> This also prepares for the case where the input can be extended to even not
>>>> any p->pages, but arbitrary data (like VFIO's potential use case in the
>>>> future?).  But that will also for later even if reasonable.
>>>>
>>>> Please have a look.  Thanks,
>>>>
>>>> [0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
>>>> [1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de
>>>>
>>>> Peter Xu (14):
>>>>    migration/multifd: Drop stale comment for multifd zero copy
>>>>    migration/multifd: multifd_send_kick_main()
>>>>    migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
>>>>    migration/multifd: Postpone reset of MultiFDPages_t
>>>>    migration/multifd: Drop MultiFDSendParams.normal[] array
>>>>    migration/multifd: Separate SYNC request with normal jobs
>>>>    migration/multifd: Simplify locking in sender thread
>>>>    migration/multifd: Drop pages->num check in sender thread
>>>>    migration/multifd: Rename p->num_packets and clean it up
>>>>    migration/multifd: Move total_normal_pages accounting
>>>>    migration/multifd: Move trace_multifd_send|recv()
>>>>    migration/multifd: multifd_send_prepare_header()
>>>>    migration/multifd: Move header prepare/fill into send_prepare()
>>>>    migration/multifd: Forbid spurious wakeups
>>>>
>>>>   migration/multifd.h      |  34 +++--
>>>>   migration/multifd-zlib.c |  11 +-
>>>>   migration/multifd-zstd.c |  11 +-
>>>>   migration/multifd.c      | 291 +++++++++++++++++++--------------------
>>>>   4 files changed, 182 insertions(+), 165 deletions(-)
>>> This series didn't survive my 9999 iterations test on the opensuse
>>> machine.
>>>
>>> # Running /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
>>> ...
>>> kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>>>
>>>
>>> #0  0x00005575dda06399 in qemu_mutex_lock_impl (mutex=0x18, file=0x5575ddce9cc3 "../util/qemu-thread-posix.c", line=275) at ../util/qemu-thread-posix.c:92
>>> #1  0x00005575dda06a94 in qemu_sem_post (sem=0x18) at ../util/qemu-thread-posix.c:275
>>> #2  0x00005575dd56a512 in multifd_send_thread (opaque=0x5575df054ef8) at ../migration/multifd.c:720
>>> #3  0x00005575dda0709b in qemu_thread_start (args=0x7fd404001d50) at ../util/qemu-thread-posix.c:541
>>> #4  0x00007fd45e8a26ea in start_thread (arg=0x7fd3faffd700) at pthread_create.c:477
>>> #5  0x00007fd45cd2150f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>>
>>> The multifd thread is posting channels_ready with an already freed
>>> multifd_send_state.
>>>
>>> This is the bug Avihai has hit. We're going into multifd_save_cleanup()
>>> so early that multifd_new_send_channel_async() hasn't even had the
>>> chance to set p->running. So it misses the join and frees everything up
>>> while a second multifd thread is just starting.
>> Thanks for doing that.
>>
>> Would this series makes that bug easier to happen?
>
> I think so.
> Patch #3 added an extra multifd_send_should_exit() check in 
> multifd_send_sync_main(), so now it can exit early if the first channel 
> fails.
> Plus, now migration state is set to FAILED early by:
> multifd_new_send_channel_async()->multifd_send_terminate_threads() and 
> multifd_tls_outgoing_handshake()->multifd_send_terminate_threads()
> so migration_iteration_run() is completely skipped because 
> migration_is_active() check before it will return false.
>
> I *think* this is what makes main migration thread finish earlier and 
> call multifd_save_cleanup() earlier, at least for me.
>

I'm doing some experiments with a global semaphore like channels_ready
instead of a per-channel structure like you suggested. I think we only
need to have a point past which we're assured no more channels will be
created. With that we'd only need one post at
multifd_new_send_channel_async.

Peter Xu Feb. 2, 2024, 2:12 a.m. UTC | #5

On Thu, Feb 01, 2024 at 06:46:35PM -0300, Fabiano Rosas wrote:
> Avihai Horon <avihaih@nvidia.com> writes:
> 
> > On 01/02/2024 7:47, Peter Xu wrote:
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On Wed, Jan 31, 2024 at 07:49:51PM -0300, Fabiano Rosas wrote:
> >>> peterx@redhat.com writes:
> >>>
> >>>> From: Peter Xu <peterx@redhat.com>
> >>>>
> >>>> This patchset contains quite a few refactorings to current multifd:
> >>>>
> >>>>    - It picked up some patches from an old series of mine [0] (the last
> >>>>      patches were dropped, though; I did the cleanup slightly differently):
> >>>>
> >>>>      I still managed to include one patch to split pending_job, but I
> >>>>      rewrote the patch here.
> >>>>
> >>>>    - It tries to cleanup multiple multifd paths here and there, the ultimate
> >>>>      goal is to redefine send_prepare() to be something like:
> >>>>
> >>>>        p->pages ----------->  send_prepare() -------------> IOVs
> >>>>
> >>>>      So that there's no obvious change yet on multifd_ops besides redefined
> >>>>      interface for send_prepare().  We may want a separate OPs for file
> >>>>      later.
> >>>>
> >>>> For 2), one benefit is already presented by Fabiano in his other series [1]
> >>>> on cleaning up zero copy, but this patchset addressed it quite differently,
> >>>> and hopefully also more gradually.  The other benefit is for sure if we
> >>>> have a more concrete API for send_prepare() and if we can reach an initial
> >>>> consensus, then we can have the recent compression accelerators rebased on
> >>>> top of this one.
> >>>>
> >>>> This also prepares for the case where the input can be extended to even not
> >>>> any p->pages, but arbitrary data (like VFIO's potential use case in the
> >>>> future?).  But that will also for later even if reasonable.
> >>>>
> >>>> Please have a look.  Thanks,
> >>>>
> >>>> [0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
> >>>> [1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de
> >>>>
> >>>> Peter Xu (14):
> >>>>    migration/multifd: Drop stale comment for multifd zero copy
> >>>>    migration/multifd: multifd_send_kick_main()
> >>>>    migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
> >>>>    migration/multifd: Postpone reset of MultiFDPages_t
> >>>>    migration/multifd: Drop MultiFDSendParams.normal[] array
> >>>>    migration/multifd: Separate SYNC request with normal jobs
> >>>>    migration/multifd: Simplify locking in sender thread
> >>>>    migration/multifd: Drop pages->num check in sender thread
> >>>>    migration/multifd: Rename p->num_packets and clean it up
> >>>>    migration/multifd: Move total_normal_pages accounting
> >>>>    migration/multifd: Move trace_multifd_send|recv()
> >>>>    migration/multifd: multifd_send_prepare_header()
> >>>>    migration/multifd: Move header prepare/fill into send_prepare()
> >>>>    migration/multifd: Forbid spurious wakeups
> >>>>
> >>>>   migration/multifd.h      |  34 +++--
> >>>>   migration/multifd-zlib.c |  11 +-
> >>>>   migration/multifd-zstd.c |  11 +-
> >>>>   migration/multifd.c      | 291 +++++++++++++++++++--------------------
> >>>>   4 files changed, 182 insertions(+), 165 deletions(-)
> >>> This series didn't survive my 9999 iterations test on the opensuse
> >>> machine.
> >>>
> >>> # Running /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
> >>> ...
> >>> kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> >>>
> >>>
> >>> #0  0x00005575dda06399 in qemu_mutex_lock_impl (mutex=0x18, file=0x5575ddce9cc3 "../util/qemu-thread-posix.c", line=275) at ../util/qemu-thread-posix.c:92
> >>> #1  0x00005575dda06a94 in qemu_sem_post (sem=0x18) at ../util/qemu-thread-posix.c:275
> >>> #2  0x00005575dd56a512 in multifd_send_thread (opaque=0x5575df054ef8) at ../migration/multifd.c:720
> >>> #3  0x00005575dda0709b in qemu_thread_start (args=0x7fd404001d50) at ../util/qemu-thread-posix.c:541
> >>> #4  0x00007fd45e8a26ea in start_thread (arg=0x7fd3faffd700) at pthread_create.c:477
> >>> #5  0x00007fd45cd2150f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> >>>
> >>> The multifd thread is posting channels_ready with an already freed
> >>> multifd_send_state.
> >>>
> >>> This is the bug Avihai has hit. We're going into multifd_save_cleanup()
> >>> so early that multifd_new_send_channel_async() hasn't even had the
> >>> chance to set p->running. So it misses the join and frees everything up
> >>> while a second multifd thread is just starting.
> >> Thanks for doing that.
> >>
> >> Would this series makes that bug easier to happen?
> >
> > I think so.
> > Patch #3 added an extra multifd_send_should_exit() check in 
> > multifd_send_sync_main(), so now it can exit early if the first channel 
> > fails.
> > Plus, now migration state is set to FAILED early by:
> > multifd_new_send_channel_async()->multifd_send_terminate_threads() and 
> > multifd_tls_outgoing_handshake()->multifd_send_terminate_threads()
> > so migration_iteration_run() is completely skipped because 
> > migration_is_active() check before it will return false.
> >
> > I *think* this is what makes main migration thread finish earlier and 
> > call multifd_save_cleanup() earlier, at least for me.
> >
> 
> I'm doing some experiments with a global semaphore like channels_ready
> instead of a per-channel structure like you suggested. I think we only
> need to have a point past which we're assured no more channels will be
> created. With that we'd only need one post at
> multifd_new_send_channel_async.

Fabiano, Avihai,

If this series is not drastically making things worse, I would leave that
issue alone for now and move on with reposting this one, with the hope that
we still have time to address this in 9.0 (while the issue existed much
longer).  I do have plan to merge this one earlier if possible, assuming
it'll be easier for the accelerator projects to rebase on top.

If I missed something please feel free to still reply in v2.

Thanks,

[00/14] migration/multifd: Refactor ->send_prepare() and cleanups

Message

Comments