[v2,2/3] Add zero-copy-copied migration stat

Message ID	20220701155935.482503-3-leobras@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Leonardo Bras <leobras@redhat.com> To: =?utf-8?q?Daniel_P=2E_Berrang=C3=A9?= <berrange@redhat.com>, Juan Quintela <quintela@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>, Peter Xu <peterx@redhat.com> Cc: Leonardo Bras <leobras@redhat.com>, qemu-devel@nongnu.org Subject: [PATCH v2 2/3] Add zero-copy-copied migration stat Date: Fri, 1 Jul 2022 12:59:35 -0300 Message-Id: <20220701155935.482503-3-leobras@redhat.com> In-Reply-To: <20220701155935.482503-1-leobras@redhat.com> References: <20220701155935.482503-1-leobras@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.129.124; envelope-from=leobras@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -28 X-Spam_score: -2.9 X-Spam_bar: -- X-Spam_report: (-2.9 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	Zero copy improvements (QIOChannel + multifd) \| expand [v2,0/3] Zero copy improvements (QIOChannel + multifd) [v2,1/3] QIOChannelSocket: Fix zero-copy flush returning code 1 when nothing sent [v2,2/3] Add zero-copy-copied migration stat [v2,3/3] migration/multifd: Warn user when zerocopy not working

Leonardo Bras July 1, 2022, 3:59 p.m. UTC

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 qapi/migration.json   | 5 ++++-
 migration/migration.c | 1 +
 monitor/hmp-cmds.c    | 4 ++++
 3 files changed, 9 insertions(+), 1 deletion(-)

Markus Armbruster July 4, 2022, 6:18 a.m. UTC | #1

Leonardo Bras <leobras@redhat.com> writes:

> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  qapi/migration.json   | 5 ++++-
>  migration/migration.c | 1 +
>  monitor/hmp-cmds.c    | 4 ++++
>  3 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 7102e474a6..925f009868 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -55,6 +55,9 @@
>  # @postcopy-bytes: The number of bytes sent during the post-copy phase
>  #                  (since 7.0).
>  #
> +# @zero-copy-copied: The number of zero-copy flushes that reported data sent
> +#                    using zero-copy that ended up being copied. (since 7.2)

The description feels awkward.  What's a "zero-copy flush", and why
should the user care?  I figure what users care about is the number of
all-zero pages we had to "copy", i.e. send the bulky way.  Is this what
@zero-copy-copied reports?

> +#
>  # Since: 0.14
>  ##
>  { 'struct': 'MigrationStats',
> @@ -65,7 +68,7 @@
>             'postcopy-requests' : 'int', 'page-size' : 'int',
>             'multifd-bytes' : 'uint64', 'pages-per-second' : 'uint64',
>             'precopy-bytes' : 'uint64', 'downtime-bytes' : 'uint64',
> -           'postcopy-bytes' : 'uint64' } }
> +           'postcopy-bytes' : 'uint64', 'zero-copy-copied' : 'uint64' } }
>  
>  ##
>  # @XBZRLECacheStats:

Daniel P. Berrangé July 4, 2022, 9:06 a.m. UTC | #2

On Mon, Jul 04, 2022 at 08:18:54AM +0200, Markus Armbruster wrote:
> Leonardo Bras <leobras@redhat.com> writes:
> 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  qapi/migration.json   | 5 ++++-
> >  migration/migration.c | 1 +
> >  monitor/hmp-cmds.c    | 4 ++++
> >  3 files changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 7102e474a6..925f009868 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -55,6 +55,9 @@
> >  # @postcopy-bytes: The number of bytes sent during the post-copy phase
> >  #                  (since 7.0).
> >  #
> > +# @zero-copy-copied: The number of zero-copy flushes that reported data sent
> > +#                    using zero-copy that ended up being copied. (since 7.2)
> 
> The description feels awkward.  What's a "zero-copy flush", and why
> should the user care?  I figure what users care about is the number of
> all-zero pages we had to "copy", i.e. send the bulky way.  Is this what
> @zero-copy-copied reports?

MigrationCapability field @zero-copy-send instructs QEMU to try to
avoid copying data between userspace and kernel space when transmitting
RAM region.

Even if the kernel supports zero copy, it is not guaranteed to happen,
it is merely a request to try.

QEMU periodically (once per migration iteration) flushes outstanding
zero-copy requests and gets an indication back of whether any copies
took place or not.

So this counter is a reflection of how many iterations resulted  in
zero-copy not being fully honoured.

IOW, ideally this counter will always be zero. If it is non-zero,
then the magnitude gives a very very very rough guide to what's
going on. If it is '1' then it was just a transient limitation.
If it matches the number of migration iterations, then it is a
more systemic limitation.

Incidentally, do we report the migration iteration count ? I
thought we did, but i'm not finding it now that I look.

With regards,
Daniel

Markus Armbruster July 4, 2022, 11:33 a.m. UTC | #3

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Jul 04, 2022 at 08:18:54AM +0200, Markus Armbruster wrote:
>> Leonardo Bras <leobras@redhat.com> writes:
>> 
>> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
>> > ---
>> >  qapi/migration.json   | 5 ++++-
>> >  migration/migration.c | 1 +
>> >  monitor/hmp-cmds.c    | 4 ++++
>> >  3 files changed, 9 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/qapi/migration.json b/qapi/migration.json
>> > index 7102e474a6..925f009868 100644
>> > --- a/qapi/migration.json
>> > +++ b/qapi/migration.json
>> > @@ -55,6 +55,9 @@
>> >  # @postcopy-bytes: The number of bytes sent during the post-copy phase
>> >  #                  (since 7.0).
>> >  #
>> > +# @zero-copy-copied: The number of zero-copy flushes that reported data sent
>> > +#                    using zero-copy that ended up being copied. (since 7.2)
>> 
>> The description feels awkward.  What's a "zero-copy flush", and why
>> should the user care?  I figure what users care about is the number of
>> all-zero pages we had to "copy", i.e. send the bulky way.  Is this what
>> @zero-copy-copied reports?
>
> MigrationCapability field @zero-copy-send instructs QEMU to try to
> avoid copying data between userspace and kernel space when transmitting
> RAM region.
>
> Even if the kernel supports zero copy, it is not guaranteed to happen,
> it is merely a request to try.
>
> QEMU periodically (once per migration iteration) flushes outstanding
> zero-copy requests and gets an indication back of whether any copies
> took place or not.
>
> So this counter is a reflection of how many iterations resulted  in
> zero-copy not being fully honoured.

Aha.  Thanks!

> IOW, ideally this counter will always be zero. If it is non-zero,
> then the magnitude gives a very very very rough guide to what's
> going on. If it is '1' then it was just a transient limitation.
> If it matches the number of migration iterations, then it is a
> more systemic limitation.

I think we should rephrase the documentation in terms of migration
iterations instead of "flushes", because the latter is a non-obvious
term without a definition.

> Incidentally, do we report the migration iteration count ? I
> thought we did, but i'm not finding it now that I look.

I was about to ask exactly that.  I'm not sure the value of
@zero-copy-copied can be usefully interpreted without.

Dr. David Alan Gilbert July 4, 2022, 11:40 a.m. UTC | #4

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Jul 04, 2022 at 08:18:54AM +0200, Markus Armbruster wrote:
> > Leonardo Bras <leobras@redhat.com> writes:
> > 
> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > ---
> > >  qapi/migration.json   | 5 ++++-
> > >  migration/migration.c | 1 +
> > >  monitor/hmp-cmds.c    | 4 ++++
> > >  3 files changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > index 7102e474a6..925f009868 100644
> > > --- a/qapi/migration.json
> > > +++ b/qapi/migration.json
> > > @@ -55,6 +55,9 @@
> > >  # @postcopy-bytes: The number of bytes sent during the post-copy phase
> > >  #                  (since 7.0).
> > >  #
> > > +# @zero-copy-copied: The number of zero-copy flushes that reported data sent
> > > +#                    using zero-copy that ended up being copied. (since 7.2)
> > 
> > The description feels awkward.  What's a "zero-copy flush", and why
> > should the user care?  I figure what users care about is the number of
> > all-zero pages we had to "copy", i.e. send the bulky way.  Is this what
> > @zero-copy-copied reports?
> 
> MigrationCapability field @zero-copy-send instructs QEMU to try to
> avoid copying data between userspace and kernel space when transmitting
> RAM region.
> 
> Even if the kernel supports zero copy, it is not guaranteed to happen,
> it is merely a request to try.
> 
> QEMU periodically (once per migration iteration) flushes outstanding
> zero-copy requests and gets an indication back of whether any copies
> took place or not.
> 
> So this counter is a reflection of how many iterations resulted  in
> zero-copy not being fully honoured.
> 
> IOW, ideally this counter will always be zero. If it is non-zero,
> then the magnitude gives a very very very rough guide to what's
> going on. If it is '1' then it was just a transient limitation.
> If it matches the number of migration iterations, then it is a
> more systemic limitation.
> 
> Incidentally, do we report the migration iteration count ? I
> thought we did, but i'm not finding it now that I look.

Yes we do; it's dirty-sync-count

Dave

> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Markus Armbruster July 4, 2022, 12:04 p.m. UTC | #5

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Daniel P. Berrangé (berrange@redhat.com) wrote:
>> On Mon, Jul 04, 2022 at 08:18:54AM +0200, Markus Armbruster wrote:
>> > Leonardo Bras <leobras@redhat.com> writes:
>> > 
>> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
>> > > ---
>> > >  qapi/migration.json   | 5 ++++-
>> > >  migration/migration.c | 1 +
>> > >  monitor/hmp-cmds.c    | 4 ++++
>> > >  3 files changed, 9 insertions(+), 1 deletion(-)
>> > >
>> > > diff --git a/qapi/migration.json b/qapi/migration.json
>> > > index 7102e474a6..925f009868 100644
>> > > --- a/qapi/migration.json
>> > > +++ b/qapi/migration.json
>> > > @@ -55,6 +55,9 @@
>> > >  # @postcopy-bytes: The number of bytes sent during the post-copy phase
>> > >  #                  (since 7.0).
>> > >  #
>> > > +# @zero-copy-copied: The number of zero-copy flushes that reported data sent
>> > > +#                    using zero-copy that ended up being copied. (since 7.2)

since 7.1, unless you're planning for really tortuous review :)

>> > 
>> > The description feels awkward.  What's a "zero-copy flush", and why
>> > should the user care?  I figure what users care about is the number of
>> > all-zero pages we had to "copy", i.e. send the bulky way.  Is this what
>> > @zero-copy-copied reports?
>> 
>> MigrationCapability field @zero-copy-send instructs QEMU to try to
>> avoid copying data between userspace and kernel space when transmitting
>> RAM region.
>> 
>> Even if the kernel supports zero copy, it is not guaranteed to happen,
>> it is merely a request to try.
>> 
>> QEMU periodically (once per migration iteration) flushes outstanding
>> zero-copy requests and gets an indication back of whether any copies
>> took place or not.
>> 
>> So this counter is a reflection of how many iterations resulted  in
>> zero-copy not being fully honoured.
>> 
>> IOW, ideally this counter will always be zero. If it is non-zero,
>> then the magnitude gives a very very very rough guide to what's
>> going on. If it is '1' then it was just a transient limitation.
>> If it matches the number of migration iterations, then it is a
>> more systemic limitation.
>> 
>> Incidentally, do we report the migration iteration count ? I
>> thought we did, but i'm not finding it now that I look.
>
> Yes we do; it's dirty-sync-count

Please rephrase the documentation of @zero-copy-copied in terms of
@dirty-sync-count.  Here's my attempt.

# @zero-copy-copied: Number of times dirty RAM synchronization could not
#                    avoid copying zero pages.  This is between 0 and
#                    @dirty-sync-count.  (since 7.1)

Daniel P. Berrangé July 4, 2022, 12:16 p.m. UTC | #6

On Mon, Jul 04, 2022 at 02:04:41PM +0200, Markus Armbruster wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> >> On Mon, Jul 04, 2022 at 08:18:54AM +0200, Markus Armbruster wrote:
> >> > Leonardo Bras <leobras@redhat.com> writes:
> >> > 
> >> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> >> > > ---
> >> > >  qapi/migration.json   | 5 ++++-
> >> > >  migration/migration.c | 1 +
> >> > >  monitor/hmp-cmds.c    | 4 ++++
> >> > >  3 files changed, 9 insertions(+), 1 deletion(-)
> >> > >
> >> > > diff --git a/qapi/migration.json b/qapi/migration.json
> >> > > index 7102e474a6..925f009868 100644
> >> > > --- a/qapi/migration.json
> >> > > +++ b/qapi/migration.json
> >> > > @@ -55,6 +55,9 @@
> >> > >  # @postcopy-bytes: The number of bytes sent during the post-copy phase
> >> > >  #                  (since 7.0).
> >> > >  #
> >> > > +# @zero-copy-copied: The number of zero-copy flushes that reported data sent
> >> > > +#                    using zero-copy that ended up being copied. (since 7.2)
> 
> since 7.1, unless you're planning for really tortuous review :)
> 
> >> > 
> >> > The description feels awkward.  What's a "zero-copy flush", and why
> >> > should the user care?  I figure what users care about is the number of
> >> > all-zero pages we had to "copy", i.e. send the bulky way.  Is this what
> >> > @zero-copy-copied reports?
> >> 
> >> MigrationCapability field @zero-copy-send instructs QEMU to try to
> >> avoid copying data between userspace and kernel space when transmitting
> >> RAM region.
> >> 
> >> Even if the kernel supports zero copy, it is not guaranteed to happen,
> >> it is merely a request to try.
> >> 
> >> QEMU periodically (once per migration iteration) flushes outstanding
> >> zero-copy requests and gets an indication back of whether any copies
> >> took place or not.
> >> 
> >> So this counter is a reflection of how many iterations resulted  in
> >> zero-copy not being fully honoured.
> >> 
> >> IOW, ideally this counter will always be zero. If it is non-zero,
> >> then the magnitude gives a very very very rough guide to what's
> >> going on. If it is '1' then it was just a transient limitation.
> >> If it matches the number of migration iterations, then it is a
> >> more systemic limitation.
> >> 
> >> Incidentally, do we report the migration iteration count ? I
> >> thought we did, but i'm not finding it now that I look.
> >
> > Yes we do; it's dirty-sync-count
> 
> Please rephrase the documentation of @zero-copy-copied in terms of
> @dirty-sync-count.  Here's my attempt.
> 
> # @zero-copy-copied: Number of times dirty RAM synchronization could not
> #                    avoid copying zero pages.  This is between 0 and
> #                    @dirty-sync-count.  (since 7.1)

Any one have preferences on the name - i get slight put off by the
repeated word in the property name here.

   @zero-copy-rejects ?
   @zero-copy-misses ?
   @zero-copy-fails ?



With regards,
Daniel

Markus Armbruster July 4, 2022, 12:59 p.m. UTC | #7

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Jul 04, 2022 at 02:04:41PM +0200, Markus Armbruster wrote:

[...]

>> Please rephrase the documentation of @zero-copy-copied in terms of
>> @dirty-sync-count.  Here's my attempt.
>> 
>> # @zero-copy-copied: Number of times dirty RAM synchronization could not
>> #                    avoid copying zero pages.  This is between 0 and
>> #                    @dirty-sync-count.  (since 7.1)
>
> Any one have preferences on the name - i get slight put off by the
> repeated word in the property name here.
>
>    @zero-copy-rejects ?
>    @zero-copy-misses ?
>    @zero-copy-fails ?

I'd consider any of these an improvement.  Not a native speaker, but
perhaps "failures" instead of "fails".

We could also express the connection to @dirty-sync-count right in the
name, like @dirty-sync-rejected-zero-copy, @dirty-sync-missed-zero-copy,
@dirty-sync-failed-zero-copy.  Or maybe -copies.

Daniel P. Berrangé July 4, 2022, 1:14 p.m. UTC | #8

On Mon, Jul 04, 2022 at 02:59:50PM +0200, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Mon, Jul 04, 2022 at 02:04:41PM +0200, Markus Armbruster wrote:
> 
> [...]
> 
> >> Please rephrase the documentation of @zero-copy-copied in terms of
> >> @dirty-sync-count.  Here's my attempt.
> >> 
> >> # @zero-copy-copied: Number of times dirty RAM synchronization could not
> >> #                    avoid copying zero pages.  This is between 0 and
> >> #                    @dirty-sync-count.  (since 7.1)
> >
> > Any one have preferences on the name - i get slight put off by the
> > repeated word in the property name here.
> >
> >    @zero-copy-rejects ?
> >    @zero-copy-misses ?
> >    @zero-copy-fails ?
> 
> I'd consider any of these an improvement.  Not a native speaker, but
> perhaps "failures" instead of "fails".
> 
> We could also express the connection to @dirty-sync-count right in the
> name, like @dirty-sync-rejected-zero-copy, @dirty-sync-missed-zero-copy,
> @dirty-sync-failed-zero-copy.  Or maybe -copies.

Yeah, @dirty-sync-missed-zero-copy   is probably my favourite.

With regards,
Daniel

Leonardo Bras July 4, 2022, 6:13 p.m. UTC | #9

Thanks Daniel, Markus and Dave for the feedback!

On Mon, 2022-07-04 at 14:14 +0100, Daniel P. Berrangé wrote:
> On Mon, Jul 04, 2022 at 02:59:50PM +0200, Markus Armbruster wrote:
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> > 
> > > On Mon, Jul 04, 2022 at 02:04:41PM +0200, Markus Armbruster wrote:
> > 
[...]

> 
> since 7.1, unless you're planning for really tortuous review :)
> 

Ok, updated :)

> > [...]
> > 
> > > > Please rephrase the documentation of @zero-copy-copied in terms of
> > > > @dirty-sync-count.  Here's my attempt.
> > > > 
> > > > # @zero-copy-copied: Number of times dirty RAM synchronization could not
> > > > #                    avoid copying zero pages.  This is between 0 and
> > > > #                    @dirty-sync-count.  (since 7.1)

That's a great description! And it's almost 100% correct.

IIUC dirty-sync-count increments on migration_bitmap_sync() once after each full
dirty-bitmap scan, and even with multifd syncing at the same point, it could
potentially have a increment per multifd channel.

The only change would be having:
# This is between 0 and @dirty-sync-count * @multifd-channel.


> > > 
> > > Any one have preferences on the name - i get slight put off by the
> > > repeated word in the property name here.
> > > 
> > >    @zero-copy-rejects ?
> > >    @zero-copy-misses ?
> > >    @zero-copy-fails ?
> > 
> > I'd consider any of these an improvement.  Not a native speaker, but
> > perhaps "failures" instead of "fails".
> > 
> > We could also express the connection to @dirty-sync-count right in the
> > name, like @dirty-sync-rejected-zero-copy, @dirty-sync-missed-zero-copy,
> > @dirty-sync-failed-zero-copy.  Or maybe -copies.
> 
> Yeah, @dirty-sync-missed-zero-copy   is probably my favourite.

Ok then, this one there is :)

I will update my patchset with the suggestions, and I should publish a v3
shortly. 

Best regards,
Leo

> 
> With regards,
> Daniel

Markus Armbruster July 5, 2022, 4:14 a.m. UTC | #10

Leonardo Brás <leobras@redhat.com> writes:

> Thanks Daniel, Markus and Dave for the feedback!
>
> On Mon, 2022-07-04 at 14:14 +0100, Daniel P. Berrangé wrote:
>> On Mon, Jul 04, 2022 at 02:59:50PM +0200, Markus Armbruster wrote:
>> > Daniel P. Berrangé <berrange@redhat.com> writes:
>> > 
>> > > On Mon, Jul 04, 2022 at 02:04:41PM +0200, Markus Armbruster wrote:
>> > 
> [...]
>
>> 
>> since 7.1, unless you're planning for really tortuous review :)
>> 
>
> Ok, updated :)
>
>> > [...]
>> > 
>> > > > Please rephrase the documentation of @zero-copy-copied in terms of
>> > > > @dirty-sync-count.  Here's my attempt.
>> > > > 
>> > > > # @zero-copy-copied: Number of times dirty RAM synchronization could not
>> > > > #                    avoid copying zero pages.  This is between 0 and
>> > > > #                    @dirty-sync-count.  (since 7.1)
>
> That's a great description! And it's almost 100% correct.

That's why we do patch review :)

> IIUC dirty-sync-count increments on migration_bitmap_sync() once after each full
> dirty-bitmap scan, and even with multifd syncing at the same point, it could
> potentially have a increment per multifd channel.
>
> The only change would be having:
> # This is between 0 and @dirty-sync-count * @multifd-channel.

Makes sense to me.

[...]

[v2,2/3] Add zero-copy-copied migration stat

Commit Message

Comments

Patch