diff mbox series

[v1,2/2] migration/multifd: Warn user when zerocopy not working

Message ID 20220628010908.390564-3-leobras@redhat.com
State New
Headers show
Series Zero copy improvements (QIOChannel + multifd) | expand

Commit Message

Leonardo Bras June 28, 2022, 1:09 a.m. UTC
Some errors, like the lack of Scatter-Gather support by the network
interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
zero-copy, which causes it to fall back to the default copying mechanism.

After each full dirty-bitmap scan there should be a zero-copy flush
happening, which checks for errors each of the previous calls to
sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
warn the user about it.

Since it happens once each full dirty-bitmap scan, even in worst case
scenario it should not print a lot of warnings, and will allow tracking
how many dirty-bitmap iterations were not able to use zero-copy send.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Daniel P. Berrangé June 28, 2022, 7:53 a.m. UTC | #1
On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> Some errors, like the lack of Scatter-Gather support by the network
> interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> zero-copy, which causes it to fall back to the default copying mechanism.

How common is this lack of SG support ? What NICs did you have that
were affected ?

> After each full dirty-bitmap scan there should be a zero-copy flush
> happening, which checks for errors each of the previous calls to
> sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> warn the user about it.
> 
> Since it happens once each full dirty-bitmap scan, even in worst case
> scenario it should not print a lot of warnings, and will allow tracking
> how many dirty-bitmap iterations were not able to use zero-copy send.

For long running migrations which are not converging, or converging
very slowly there could be 100's of passes.



> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  migration/multifd.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 684c014c86..9c62aec84e 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -624,6 +624,9 @@ int multifd_send_sync_main(QEMUFile *f)
>              if (ret < 0) {
>                  error_report_err(err);
>                  return -1;
> +            } else if (ret == 1) {
> +                warn_report("The network device is not able to use "
> +                            "zero-copy-send: copying is being used");
>              }
>          }
>      }
> -- 
> 2.36.1
> 

With regards,
Daniel
Leonardo Bras June 28, 2022, 12:32 p.m. UTC | #2
On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > Some errors, like the lack of Scatter-Gather support by the network
> > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > zero-copy, which causes it to fall back to the default copying mechanism.
>
> How common is this lack of SG support ? What NICs did you have that
> were affected ?

I am not aware of any NIC without SG available for testing, nor have
any idea on how common they are.
But since we can detect sendmsg() falling back to copying we should
warn the user if this ever happens.

There is also a case in IPv6 related to fragmentation that may cause
MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
covered.

>
> > After each full dirty-bitmap scan there should be a zero-copy flush
> > happening, which checks for errors each of the previous calls to
> > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > warn the user about it.
> >
> > Since it happens once each full dirty-bitmap scan, even in worst case
> > scenario it should not print a lot of warnings, and will allow tracking
> > how many dirty-bitmap iterations were not able to use zero-copy send.
>
> For long running migrations which are not converging, or converging
> very slowly there could be 100's of passes.
>

I could change it so it only warns once, if that is too much output.

Best regards,
Leo

>
>
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  migration/multifd.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 684c014c86..9c62aec84e 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -624,6 +624,9 @@ int multifd_send_sync_main(QEMUFile *f)
> >              if (ret < 0) {
> >                  error_report_err(err);
> >                  return -1;
> > +            } else if (ret == 1) {
> > +                warn_report("The network device is not able to use "
> > +                            "zero-copy-send: copying is being used");
> >              }
> >          }
> >      }
> > --
> > 2.36.1
> >
>
> With regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Daniel P. Berrangé June 28, 2022, 1:02 p.m. UTC | #3
On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos wrote:
> On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > Some errors, like the lack of Scatter-Gather support by the network
> > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > > zero-copy, which causes it to fall back to the default copying mechanism.
> >
> > How common is this lack of SG support ? What NICs did you have that
> > were affected ?
> 
> I am not aware of any NIC without SG available for testing, nor have
> any idea on how common they are.
> But since we can detect sendmsg() falling back to copying we should
> warn the user if this ever happens.
> 
> There is also a case in IPv6 related to fragmentation that may cause
> MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> covered.
> 
> >
> > > After each full dirty-bitmap scan there should be a zero-copy flush
> > > happening, which checks for errors each of the previous calls to
> > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > > warn the user about it.
> > >
> > > Since it happens once each full dirty-bitmap scan, even in worst case
> > > scenario it should not print a lot of warnings, and will allow tracking
> > > how many dirty-bitmap iterations were not able to use zero-copy send.
> >
> > For long running migrations which are not converging, or converging
> > very slowly there could be 100's of passes.
> >
> 
> I could change it so it only warns once, if that is too much output.

Well I'm mostly wondering what we're expecting the user todo with this
information. Generally a log file containing warnings ends up turning
into a bug report. If we think it is important for users and/or mgmt
apps to be aware of this info, then it might be better to actually
put a field in the query-migrate stats to report if zero-copy is
being honoured or not, and just have a trace point in this location
instead.

With regards,
Daniel
Dr. David Alan Gilbert June 28, 2022, 1:52 p.m. UTC | #4
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos wrote:
> > On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > > Some errors, like the lack of Scatter-Gather support by the network
> > > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > > > zero-copy, which causes it to fall back to the default copying mechanism.
> > >
> > > How common is this lack of SG support ? What NICs did you have that
> > > were affected ?
> > 
> > I am not aware of any NIC without SG available for testing, nor have
> > any idea on how common they are.
> > But since we can detect sendmsg() falling back to copying we should
> > warn the user if this ever happens.
> > 
> > There is also a case in IPv6 related to fragmentation that may cause
> > MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> > covered.
> > 
> > >
> > > > After each full dirty-bitmap scan there should be a zero-copy flush
> > > > happening, which checks for errors each of the previous calls to
> > > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > > > warn the user about it.
> > > >
> > > > Since it happens once each full dirty-bitmap scan, even in worst case
> > > > scenario it should not print a lot of warnings, and will allow tracking
> > > > how many dirty-bitmap iterations were not able to use zero-copy send.
> > >
> > > For long running migrations which are not converging, or converging
> > > very slowly there could be 100's of passes.
> > >
> > 
> > I could change it so it only warns once, if that is too much output.
> 
> Well I'm mostly wondering what we're expecting the user todo with this
> information. Generally a log file containing warnings ends up turning
> into a bug report. If we think it is important for users and/or mgmt
> apps to be aware of this info, then it might be better to actually
> put a field in the query-migrate stats to report if zero-copy is
> being honoured or not,

Yeh just a counter would work there I think.

> and just have a trace point in this location
> instead.

Yeh.

Dave

> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Leonardo Bras June 28, 2022, 4:53 p.m. UTC | #5
On Tue, Jun 28, 2022 at 10:52 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos wrote:
> > > On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > > > Some errors, like the lack of Scatter-Gather support by the network
> > > > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > > > > zero-copy, which causes it to fall back to the default copying mechanism.
> > > >
> > > > How common is this lack of SG support ? What NICs did you have that
> > > > were affected ?
> > >
> > > I am not aware of any NIC without SG available for testing, nor have
> > > any idea on how common they are.
> > > But since we can detect sendmsg() falling back to copying we should
> > > warn the user if this ever happens.
> > >
> > > There is also a case in IPv6 related to fragmentation that may cause
> > > MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> > > covered.
> > >
> > > >
> > > > > After each full dirty-bitmap scan there should be a zero-copy flush
> > > > > happening, which checks for errors each of the previous calls to
> > > > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > > > > warn the user about it.
> > > > >
> > > > > Since it happens once each full dirty-bitmap scan, even in worst case
> > > > > scenario it should not print a lot of warnings, and will allow tracking
> > > > > how many dirty-bitmap iterations were not able to use zero-copy send.
> > > >
> > > > For long running migrations which are not converging, or converging
> > > > very slowly there could be 100's of passes.
> > > >
> > >
> > > I could change it so it only warns once, if that is too much output.
> >
> > Well I'm mostly wondering what we're expecting the user todo with this
> > information.


My rationale on that:
- zero-copy-send is a feature that is supposed to improve send
throughput by reducing cpu usage.
- there is a chance the sendmsg(MSG_ZEROCOPY) fails to use zero-copy
- if this happens, there will be a potential throughput decrease on sendmsg()
- the user (or management app) need to know when zero-copy-send is
degrading throughput, so it can be disabled
- this is also important for performance testing, given it can be
confusing having zero-copy-send improving throughput in some cases,
and degrading in others, without any apparent reason why.

> > Generally a log file containing warnings ends up turning
> > into a bug report. If we think it is important for users and/or mgmt
> > apps to be aware of this info, then it might be better to actually
> > put a field in the query-migrate stats to report if zero-copy is
> > being honoured or not,
>
> Yeh just a counter would work there I think.

The warning idea was totally due to my inexperience on this mgmt app
interface, since I had no other idea on how to deal with that.

I think having it in query-migrate is a much better idea than a
warning, since it should be much easier to parse and disable
zero-copy-send if desired.
Even in my current qemu test script, it's much better having it in
query-migrate.

>
> > and just have a trace point in this location
> > instead.
>
> Yeh.
>

Yeap, the counter idea seems great!
Will it be always printed there, or only when zero-copy-send is enabled?

Best regards,
Leo

> Dave
>
> > With regards,
> > Daniel
> > --
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Dr. David Alan Gilbert June 28, 2022, 4:56 p.m. UTC | #6
* Leonardo Bras Soares Passos (leobras@redhat.com) wrote:
> On Tue, Jun 28, 2022 at 10:52 AM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos wrote:
> > > > On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > > > > Some errors, like the lack of Scatter-Gather support by the network
> > > > > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > > > > > zero-copy, which causes it to fall back to the default copying mechanism.
> > > > >
> > > > > How common is this lack of SG support ? What NICs did you have that
> > > > > were affected ?
> > > >
> > > > I am not aware of any NIC without SG available for testing, nor have
> > > > any idea on how common they are.
> > > > But since we can detect sendmsg() falling back to copying we should
> > > > warn the user if this ever happens.
> > > >
> > > > There is also a case in IPv6 related to fragmentation that may cause
> > > > MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> > > > covered.
> > > >
> > > > >
> > > > > > After each full dirty-bitmap scan there should be a zero-copy flush
> > > > > > happening, which checks for errors each of the previous calls to
> > > > > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > > > > > warn the user about it.
> > > > > >
> > > > > > Since it happens once each full dirty-bitmap scan, even in worst case
> > > > > > scenario it should not print a lot of warnings, and will allow tracking
> > > > > > how many dirty-bitmap iterations were not able to use zero-copy send.
> > > > >
> > > > > For long running migrations which are not converging, or converging
> > > > > very slowly there could be 100's of passes.
> > > > >
> > > >
> > > > I could change it so it only warns once, if that is too much output.
> > >
> > > Well I'm mostly wondering what we're expecting the user todo with this
> > > information.
> 
> 
> My rationale on that:
> - zero-copy-send is a feature that is supposed to improve send
> throughput by reducing cpu usage.
> - there is a chance the sendmsg(MSG_ZEROCOPY) fails to use zero-copy
> - if this happens, there will be a potential throughput decrease on sendmsg()
> - the user (or management app) need to know when zero-copy-send is
> degrading throughput, so it can be disabled
> - this is also important for performance testing, given it can be
> confusing having zero-copy-send improving throughput in some cases,
> and degrading in others, without any apparent reason why.
> 
> > > Generally a log file containing warnings ends up turning
> > > into a bug report. If we think it is important for users and/or mgmt
> > > apps to be aware of this info, then it might be better to actually
> > > put a field in the query-migrate stats to report if zero-copy is
> > > being honoured or not,
> >
> > Yeh just a counter would work there I think.
> 
> The warning idea was totally due to my inexperience on this mgmt app
> interface, since I had no other idea on how to deal with that.

Yeh it's not too silly an idea!
The way some of these warning or stats get to us can be a bit random,
but sometimes can confuse things.

> I think having it in query-migrate is a much better idea than a
> warning, since it should be much easier to parse and disable
> zero-copy-send if desired.
> Even in my current qemu test script, it's much better having it in
> query-migrate.
> 
> >
> > > and just have a trace point in this location
> > > instead.
> >
> > Yeh.
> >
> 
> Yeap, the counter idea seems great!
> Will it be always printed there, or only when zero-copy-send is enabled?

You could make it either if it's enabled or if it's none zero.
(I guess you want it to reset to 0 at the start of a new migration).

Dave

> 
> Best regards,
> Leo
> 
> > Dave
> >
> > > With regards,
> > > Daniel
> > > --
> > > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
>
Leonardo Bras July 1, 2022, 6:18 a.m. UTC | #7
On Tue, 2022-06-28 at 17:56 +0100, Dr. David Alan Gilbert wrote:
> * Leonardo Bras Soares Passos (leobras@redhat.com) wrote:
> > On Tue, Jun 28, 2022 at 10:52 AM Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > > 
> > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos
> > > > wrote:
> > > > > On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé
> > > > > <berrange@redhat.com> wrote:
> > > > > > 
> > > > > > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > > > > > Some errors, like the lack of Scatter-Gather support by the
> > > > > > > network
> > > > > > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail
> > > > > > > on using
> > > > > > > zero-copy, which causes it to fall back to the default copying
> > > > > > > mechanism.
> > > > > > 
> > > > > > How common is this lack of SG support ? What NICs did you have that
> > > > > > were affected ?
> > > > > 
> > > > > I am not aware of any NIC without SG available for testing, nor have
> > > > > any idea on how common they are.
> > > > > But since we can detect sendmsg() falling back to copying we should
> > > > > warn the user if this ever happens.
> > > > > 
> > > > > There is also a case in IPv6 related to fragmentation that may cause
> > > > > MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> > > > > covered.
> > > > > 
> > > > > > 
> > > > > > > After each full dirty-bitmap scan there should be a zero-copy
> > > > > > > flush
> > > > > > > happening, which checks for errors each of the previous calls to
> > > > > > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy,
> > > > > > > then
> > > > > > > warn the user about it.
> > > > > > > 
> > > > > > > Since it happens once each full dirty-bitmap scan, even in worst
> > > > > > > case
> > > > > > > scenario it should not print a lot of warnings, and will allow
> > > > > > > tracking
> > > > > > > how many dirty-bitmap iterations were not able to use zero-copy
> > > > > > > send.
> > > > > > 
> > > > > > For long running migrations which are not converging, or converging
> > > > > > very slowly there could be 100's of passes.
> > > > > > 
> > > > > 
> > > > > I could change it so it only warns once, if that is too much output.
> > > > 
> > > > Well I'm mostly wondering what we're expecting the user todo with this
> > > > information.
> > 
> > 
> > My rationale on that:
> > - zero-copy-send is a feature that is supposed to improve send
> > throughput by reducing cpu usage.
> > - there is a chance the sendmsg(MSG_ZEROCOPY) fails to use zero-copy
> > - if this happens, there will be a potential throughput decrease on
> > sendmsg()
> > - the user (or management app) need to know when zero-copy-send is
> > degrading throughput, so it can be disabled
> > - this is also important for performance testing, given it can be
> > confusing having zero-copy-send improving throughput in some cases,
> > and degrading in others, without any apparent reason why.
> > 
> > > > Generally a log file containing warnings ends up turning
> > > > into a bug report. If we think it is important for users and/or mgmt
> > > > apps to be aware of this info, then it might be better to actually
> > > > put a field in the query-migrate stats to report if zero-copy is
> > > > being honoured or not,
> > > 
> > > Yeh just a counter would work there I think.
> > 
> > The warning idea was totally due to my inexperience on this mgmt app
> > interface, since I had no other idea on how to deal with that.
> 
> Yeh it's not too silly an idea!
> The way some of these warning or stats get to us can be a bit random,
> but sometimes can confuse things.
> 
> > I think having it in query-migrate is a much better idea than a
> > warning, since it should be much easier to parse and disable
> > zero-copy-send if desired.
> > Even in my current qemu test script, it's much better having it in
> > query-migrate.
> > 
> > > 
> > > > and just have a trace point in this location
> > > > instead.
> > > 
> > > Yeh.
> > > 
> > 
> > Yeap, the counter idea seems great!
> > Will it be always printed there, or only when zero-copy-send is enabled?
> 
> You could make it either if it's enabled or if it's none zero.
> (I guess you want it to reset to 0 at the start of a new migration).
> 
> Dave

Thanks for this feedback!

I have everything already working, but I am struggling with a good property
name. 

I am currently using zero_copy_copied (or zero-copy-copied in json), but it does
not look like a good Migration stat name. 

Do you have any suggestion?

Best regards,
Leo


> 
> > 
> > Best regards,
> > Leo
> > 
> > > Dave
> > > 
> > > > With regards,
> > > > Daniel
> > > > --
> > > > > : https://berrange.com      -o-   
> > > > > https://www.flickr.com/photos/dberrange :|
> > > > > : https://libvirt.org         -o-           
> > > > > https://fstop138.berrange.com :|
> > > > > : https://entangle-photo.org    -o-   
> > > > > https://www.instagram.com/dberrange :|
> > > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > 
> >
Dr. David Alan Gilbert July 4, 2022, 9:19 a.m. UTC | #8
* Leonardo Brás (leobras@redhat.com) wrote:
> On Tue, 2022-06-28 at 17:56 +0100, Dr. David Alan Gilbert wrote:
> > * Leonardo Bras Soares Passos (leobras@redhat.com) wrote:
> > > On Tue, Jun 28, 2022 at 10:52 AM Dr. David Alan Gilbert
> > > <dgilbert@redhat.com> wrote:
> > > > 
> > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos
> > > > > wrote:
> > > > > > On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé
> > > > > > <berrange@redhat.com> wrote:
> > > > > > > 
> > > > > > > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > > > > > > Some errors, like the lack of Scatter-Gather support by the
> > > > > > > > network
> > > > > > > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail
> > > > > > > > on using
> > > > > > > > zero-copy, which causes it to fall back to the default copying
> > > > > > > > mechanism.
> > > > > > > 
> > > > > > > How common is this lack of SG support ? What NICs did you have that
> > > > > > > were affected ?
> > > > > > 
> > > > > > I am not aware of any NIC without SG available for testing, nor have
> > > > > > any idea on how common they are.
> > > > > > But since we can detect sendmsg() falling back to copying we should
> > > > > > warn the user if this ever happens.
> > > > > > 
> > > > > > There is also a case in IPv6 related to fragmentation that may cause
> > > > > > MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> > > > > > covered.
> > > > > > 
> > > > > > > 
> > > > > > > > After each full dirty-bitmap scan there should be a zero-copy
> > > > > > > > flush
> > > > > > > > happening, which checks for errors each of the previous calls to
> > > > > > > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy,
> > > > > > > > then
> > > > > > > > warn the user about it.
> > > > > > > > 
> > > > > > > > Since it happens once each full dirty-bitmap scan, even in worst
> > > > > > > > case
> > > > > > > > scenario it should not print a lot of warnings, and will allow
> > > > > > > > tracking
> > > > > > > > how many dirty-bitmap iterations were not able to use zero-copy
> > > > > > > > send.
> > > > > > > 
> > > > > > > For long running migrations which are not converging, or converging
> > > > > > > very slowly there could be 100's of passes.
> > > > > > > 
> > > > > > 
> > > > > > I could change it so it only warns once, if that is too much output.
> > > > > 
> > > > > Well I'm mostly wondering what we're expecting the user todo with this
> > > > > information.
> > > 
> > > 
> > > My rationale on that:
> > > - zero-copy-send is a feature that is supposed to improve send
> > > throughput by reducing cpu usage.
> > > - there is a chance the sendmsg(MSG_ZEROCOPY) fails to use zero-copy
> > > - if this happens, there will be a potential throughput decrease on
> > > sendmsg()
> > > - the user (or management app) need to know when zero-copy-send is
> > > degrading throughput, so it can be disabled
> > > - this is also important for performance testing, given it can be
> > > confusing having zero-copy-send improving throughput in some cases,
> > > and degrading in others, without any apparent reason why.
> > > 
> > > > > Generally a log file containing warnings ends up turning
> > > > > into a bug report. If we think it is important for users and/or mgmt
> > > > > apps to be aware of this info, then it might be better to actually
> > > > > put a field in the query-migrate stats to report if zero-copy is
> > > > > being honoured or not,
> > > > 
> > > > Yeh just a counter would work there I think.
> > > 
> > > The warning idea was totally due to my inexperience on this mgmt app
> > > interface, since I had no other idea on how to deal with that.
> > 
> > Yeh it's not too silly an idea!
> > The way some of these warning or stats get to us can be a bit random,
> > but sometimes can confuse things.
> > 
> > > I think having it in query-migrate is a much better idea than a
> > > warning, since it should be much easier to parse and disable
> > > zero-copy-send if desired.
> > > Even in my current qemu test script, it's much better having it in
> > > query-migrate.
> > > 
> > > > 
> > > > > and just have a trace point in this location
> > > > > instead.
> > > > 
> > > > Yeh.
> > > > 
> > > 
> > > Yeap, the counter idea seems great!
> > > Will it be always printed there, or only when zero-copy-send is enabled?
> > 
> > You could make it either if it's enabled or if it's none zero.
> > (I guess you want it to reset to 0 at the start of a new migration).
> > 
> > Dave
> 
> Thanks for this feedback!
> 
> I have everything already working, but I am struggling with a good property
> name. 
> 
> I am currently using zero_copy_copied (or zero-copy-copied in json), but it does
> not look like a good Migration stat name. 
> 
> Do you have any suggestion?

Shrug; I'm not going to fuss over the name too much as long as it's
reasonable. 'zero-copied' might be OK.

Dave

> Best regards,
> Leo
> 
> 
> > 
> > > 
> > > Best regards,
> > > Leo
> > > 
> > > > Dave
> > > > 
> > > > > With regards,
> > > > > Daniel
> > > > > --
> > > > > > : https://berrange.com      -o-   
> > > > > > https://www.flickr.com/photos/dberrange :|
> > > > > > : https://libvirt.org         -o-           
> > > > > > https://fstop138.berrange.com :|
> > > > > > : https://entangle-photo.org    -o-   
> > > > > > https://www.instagram.com/dberrange :|
> > > > > 
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > 
> > > 
>
diff mbox series

Patch

diff --git a/migration/multifd.c b/migration/multifd.c
index 684c014c86..9c62aec84e 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -624,6 +624,9 @@  int multifd_send_sync_main(QEMUFile *f)
             if (ret < 0) {
                 error_report_err(err);
                 return -1;
+            } else if (ret == 1) {
+                warn_report("The network device is not able to use "
+                            "zero-copy-send: copying is being used");
             }
         }
     }