[v1,3/3] migration: multifd: Enable zerocopy

Message ID	20210831110238.299458-4-leobras@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Leonardo Bras <leobras@redhat.com> To: =?utf-8?q?Marc-Andr=C3=A9_Lureau?= <marcandre.lureau@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Elena Ufimtseva <elena.ufimtseva@oracle.com>, Jagannathan Raman <jag.raman@oracle.com>, John G Johnson <john.g.johnson@oracle.com>, =?utf-8?q?Daniel_P=2E_Berrang?= =?utf-8?q?=C3=A9?= <berrange@redhat.com>, Juan Quintela <quintela@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Fam Zheng <fam@euphon.net>, Peter Xu <peterx@redhat.com> Subject: [PATCH v1 3/3] migration: multifd: Enable zerocopy Date: Tue, 31 Aug 2021 08:02:39 -0300 Message-Id: <20210831110238.299458-4-leobras@redhat.com> In-Reply-To: <20210831110238.299458-1-leobras@redhat.com> References: <20210831110238.299458-1-leobras@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII" Received-SPF: pass client-ip=216.205.24.124; envelope-from=leobras@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -31 X-Spam_score: -3.2 X-Spam_bar: --- X-Spam_report: (-3.2 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.391, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: Leonardo Bras <leobras@redhat.com>, qemu-devel@nongnu.org, qemu-block@nongnu.org Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	QIOChannel flags + multifd zerocopy \| expand [v1,0/3] QIOChannel flags + multifd zerocopy [v1,1/3] io: Enable write flags for QIOChannel [v1,2/3] io: Add zerocopy and errqueue [v1,3/3] migration: multifd: Enable zerocopy

Leonardo Bras Aug. 31, 2021, 11:02 a.m. UTC

Call qio_channel_set_zerocopy(true) in the start of every multifd thread.

Change the send_write() interface of multifd, allowing it to pass down
flags for qio_channel_write*().

Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
other data being sent at the default copying approach.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd-zlib.c | 7 ++++---
 migration/multifd-zstd.c | 7 ++++---
 migration/multifd.c      | 9 ++++++---
 migration/multifd.h      | 3 ++-
 4 files changed, 16 insertions(+), 10 deletions(-)

Daniel P. Berrangé Aug. 31, 2021, 1:16 p.m. UTC | #1

On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> 
> Change the send_write() interface of multifd, allowing it to pass down
> flags for qio_channel_write*().
> 
> Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> other data being sent at the default copying approach.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  migration/multifd-zlib.c | 7 ++++---
>  migration/multifd-zstd.c | 7 ++++---
>  migration/multifd.c      | 9 ++++++---
>  migration/multifd.h      | 3 ++-
>  4 files changed, 16 insertions(+), 10 deletions(-)

> @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
>              }
>  
>              if (used) {
> -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> +                                                          &local_err);

I don't think it is valid to unconditionally enable this feature due to the
resource usage implications

https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html

  "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
   if the socket option was not set, the socket exceeds its optmem 
   limit or the user exceeds its ulimit on locked pages."

The limit on locked pages is something that looks very likely to be
exceeded unless you happen to be running a QEMU config that already
implies locked memory (eg PCI assignment)


Regards,
Daniel

Peter Xu Aug. 31, 2021, 8:29 p.m. UTC | #2

On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > 
> > Change the send_write() interface of multifd, allowing it to pass down
> > flags for qio_channel_write*().
> > 
> > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > other data being sent at the default copying approach.
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  migration/multifd-zlib.c | 7 ++++---
> >  migration/multifd-zstd.c | 7 ++++---
> >  migration/multifd.c      | 9 ++++++---
> >  migration/multifd.h      | 3 ++-
> >  4 files changed, 16 insertions(+), 10 deletions(-)
> 
> > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> >              }
> >  
> >              if (used) {
> > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > +                                                          &local_err);
> 
> I don't think it is valid to unconditionally enable this feature due to the
> resource usage implications
> 
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> 
>   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
>    if the socket option was not set, the socket exceeds its optmem 
>    limit or the user exceeds its ulimit on locked pages."
> 
> The limit on locked pages is something that looks very likely to be
> exceeded unless you happen to be running a QEMU config that already
> implies locked memory (eg PCI assignment)

Yes it would be great to be a migration capability in parallel to multifd. At
initial phase if it's easy to be implemented on multi-fd only, we can add a
dependency between the caps.  In the future we can remove that dependency when
the code is ready to go without multifd.  Thanks,

Daniel P. Berrangé Sept. 1, 2021, 8:53 a.m. UTC | #3

On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > 
> > > Change the send_write() interface of multifd, allowing it to pass down
> > > flags for qio_channel_write*().
> > > 
> > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > other data being sent at the default copying approach.
> > > 
> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > ---
> > >  migration/multifd-zlib.c | 7 ++++---
> > >  migration/multifd-zstd.c | 7 ++++---
> > >  migration/multifd.c      | 9 ++++++---
> > >  migration/multifd.h      | 3 ++-
> > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > 
> > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > >              }
> > >  
> > >              if (used) {
> > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > +                                                          &local_err);
> > 
> > I don't think it is valid to unconditionally enable this feature due to the
> > resource usage implications
> > 
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > 
> >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> >    if the socket option was not set, the socket exceeds its optmem 
> >    limit or the user exceeds its ulimit on locked pages."
> > 
> > The limit on locked pages is something that looks very likely to be
> > exceeded unless you happen to be running a QEMU config that already
> > implies locked memory (eg PCI assignment)
> 
> Yes it would be great to be a migration capability in parallel to multifd. At
> initial phase if it's easy to be implemented on multi-fd only, we can add a
> dependency between the caps.  In the future we can remove that dependency when
> the code is ready to go without multifd.  Thanks,

Also, I'm wondering how zerocopy support interacts with kernel support
for kTLS and multipath-TCP, both of which we want to be able to use
with migration.


Regards,
Daniel

Peter Xu Sept. 1, 2021, 3:35 p.m. UTC | #4

On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > 
> > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > flags for qio_channel_write*().
> > > > 
> > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > other data being sent at the default copying approach.
> > > > 
> > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > ---
> > > >  migration/multifd-zlib.c | 7 ++++---
> > > >  migration/multifd-zstd.c | 7 ++++---
> > > >  migration/multifd.c      | 9 ++++++---
> > > >  migration/multifd.h      | 3 ++-
> > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > 
> > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > >              }
> > > >  
> > > >              if (used) {
> > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > +                                                          &local_err);
> > > 
> > > I don't think it is valid to unconditionally enable this feature due to the
> > > resource usage implications
> > > 
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > 
> > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> > >    if the socket option was not set, the socket exceeds its optmem 
> > >    limit or the user exceeds its ulimit on locked pages."
> > > 
> > > The limit on locked pages is something that looks very likely to be
> > > exceeded unless you happen to be running a QEMU config that already
> > > implies locked memory (eg PCI assignment)
> > 
> > Yes it would be great to be a migration capability in parallel to multifd. At
> > initial phase if it's easy to be implemented on multi-fd only, we can add a
> > dependency between the caps.  In the future we can remove that dependency when
> > the code is ready to go without multifd.  Thanks,
> 
> Also, I'm wondering how zerocopy support interacts with kernel support
> for kTLS and multipath-TCP, both of which we want to be able to use
> with migration.

Copying Jason Wang for net implications between these features on kernel side
and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).

From the safe side we may want to only enable one of them until we prove
they'll work together I guess..

Not a immediate concern as I don't really think any of them is really
explicitly supported in qemu.

KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
least we need some knob to detect whether kTLS is enabled in gnutls.

Daniel P. Berrangé Sept. 1, 2021, 3:44 p.m. UTC | #5

On Wed, Sep 01, 2021 at 11:35:33AM -0400, Peter Xu wrote:
> On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > 
> > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > flags for qio_channel_write*().
> > > > > 
> > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > other data being sent at the default copying approach.
> > > > > 
> > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > ---
> > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > >  migration/multifd.c      | 9 ++++++---
> > > > >  migration/multifd.h      | 3 ++-
> > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > 
> > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > >              }
> > > > >  
> > > > >              if (used) {
> > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > +                                                          &local_err);
> > > > 
> > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > resource usage implications
> > > > 
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > 
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> > > >    if the socket option was not set, the socket exceeds its optmem 
> > > >    limit or the user exceeds its ulimit on locked pages."
> > > > 
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > > 
> > > Yes it would be great to be a migration capability in parallel to multifd. At
> > > initial phase if it's easy to be implemented on multi-fd only, we can add a
> > > dependency between the caps.  In the future we can remove that dependency when
> > > the code is ready to go without multifd.  Thanks,
> > 
> > Also, I'm wondering how zerocopy support interacts with kernel support
> > for kTLS and multipath-TCP, both of which we want to be able to use
> > with migration.
> 
> Copying Jason Wang for net implications between these features on kernel side
> and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).
> 
> From the safe side we may want to only enable one of them until we prove
> they'll work together I guess..

MPTCP is good when we're network limited for migration

KTLS will be good when we're CPU limited on AES for migration,
which is essentially always when TLS is used.

ZEROCOPY will be good when we're CPU limited for data copy
on migration, or to reduce the impact on other concurrent
VMs on the same CPUs.

Ultimately we woudld benefit from all of them at the same
time, if it were technically possible todo.

> Not a immediate concern as I don't really think any of them is really
> explicitly supported in qemu.

QEMU has mptcp support already:

  commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7
  Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
  Date:   Wed Apr 21 12:28:34 2021 +0100

    sockets: Support multipath TCP
    
    Multipath TCP allows combining multiple interfaces/routes into a single
    socket, with very little work for the user/admin.
    
    It's enabled by 'mptcp' on most socket addresses:
    
       ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp

> KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> least we need some knob to detect whether kTLS is enabled in gnutls.

It isn't possible for gnutls to transparently enable KTLS, because
GNUTLS doesn't get to see the actual socket directly - it'll need
some work in QEMU to enable it.  We know MPTCP and KTLS are currently
mutually exclusive as they both use the same kernel network hooks
framework.

Regards,
Daniel

Peter Xu Sept. 1, 2021, 4:01 p.m. UTC | #6

On Wed, Sep 01, 2021 at 04:44:30PM +0100, Daniel P. Berrangé wrote:
> QEMU has mptcp support already:
> 
>   commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7
>   Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
>   Date:   Wed Apr 21 12:28:34 2021 +0100
> 
>     sockets: Support multipath TCP
>     
>     Multipath TCP allows combining multiple interfaces/routes into a single
>     socket, with very little work for the user/admin.
>     
>     It's enabled by 'mptcp' on most socket addresses:
>     
>        ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp

Oops, I totally forgot about that, sorry!

> 
> > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> > gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> > least we need some knob to detect whether kTLS is enabled in gnutls.
> 
> It isn't possible for gnutls to transparently enable KTLS, because
> GNUTLS doesn't get to see the actual socket directly - it'll need
> some work in QEMU to enable it.  We know MPTCP and KTLS are currently
> mutually exclusive as they both use the same kernel network hooks
> framework.

Then we may need to at least figure out whether zerocopy needs to mask out
mptcp.

Leonardo Bras Sept. 2, 2021, 7:22 a.m. UTC | #7

Hello Daniel, thanks for the feedback !

On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> >
> > Change the send_write() interface of multifd, allowing it to pass down
> > flags for qio_channel_write*().
> >
> > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > other data being sent at the default copying approach.
> >
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  migration/multifd-zlib.c | 7 ++++---
> >  migration/multifd-zstd.c | 7 ++++---
> >  migration/multifd.c      | 9 ++++++---
> >  migration/multifd.h      | 3 ++-
> >  4 files changed, 16 insertions(+), 10 deletions(-)
>
> > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> >              }
> >
> >              if (used) {
> > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > +                                                          &local_err);
>
> I don't think it is valid to unconditionally enable this feature due to the
> resource usage implications
>
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
>
>   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
>    if the socket option was not set, the socket exceeds its optmem
>    limit or the user exceeds its ulimit on locked pages."

You are correct, I unfortunately missed this part in the docs :(

> The limit on locked pages is something that looks very likely to be
> exceeded unless you happen to be running a QEMU config that already
> implies locked memory (eg PCI assignment)

Do you mean the limit an user has on locking memory?

If so, that makes sense. I remember I needed to set the upper limit of locked
memory for the user before using it, or adding a capability to qemu before.

Maybe an option would be trying to mlock all guest memory before setting
zerocopy=on in qemu code. If it fails, we can print an error message and fall
back to not using zerocopy (following the idea of a new io_async_writev()
I told you in the previous mail).


>
>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Best regards,
Leonardo

Jason Wang Sept. 2, 2021, 7:23 a.m. UTC | #8

在 2021/9/1 下午11:35, Peter Xu 写道:
> On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
>> On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
>>> On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
>>>> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
>>>>> Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
>>>>>
>>>>> Change the send_write() interface of multifd, allowing it to pass down
>>>>> flags for qio_channel_write*().
>>>>>
>>>>> Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
>>>>> other data being sent at the default copying approach.
>>>>>
>>>>> Signed-off-by: Leonardo Bras <leobras@redhat.com>
>>>>> ---
>>>>>   migration/multifd-zlib.c | 7 ++++---
>>>>>   migration/multifd-zstd.c | 7 ++++---
>>>>>   migration/multifd.c      | 9 ++++++---
>>>>>   migration/multifd.h      | 3 ++-
>>>>>   4 files changed, 16 insertions(+), 10 deletions(-)
>>>>> @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
>>>>>               }
>>>>>   
>>>>>               if (used) {
>>>>> -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
>>>>> +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
>>>>> +                                                          &local_err);
>>>> I don't think it is valid to unconditionally enable this feature due to the
>>>> resource usage implications
>>>>
>>>> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
>>>>
>>>>    "A zerocopy failure will return -1 with errno ENOBUFS. This happens
>>>>     if the socket option was not set, the socket exceeds its optmem
>>>>     limit or the user exceeds its ulimit on locked pages."
>>>>
>>>> The limit on locked pages is something that looks very likely to be
>>>> exceeded unless you happen to be running a QEMU config that already
>>>> implies locked memory (eg PCI assignment)
>>> Yes it would be great to be a migration capability in parallel to multifd. At
>>> initial phase if it's easy to be implemented on multi-fd only, we can add a
>>> dependency between the caps.  In the future we can remove that dependency when
>>> the code is ready to go without multifd.  Thanks,
>> Also, I'm wondering how zerocopy support interacts with kernel support
>> for kTLS and multipath-TCP, both of which we want to be able to use
>> with migration.
> Copying Jason Wang for net implications between these features on kernel side


Note that the MSG_ZEROCOPY is contributed by Google :)


> and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).


I think they can. Anyway kernel can choose to do datacopy when necessary.

Note that the "zerocopy" is probably not correct here. What's better is 
"Enable MSG_ZEROCOPY" since:

1) kernel supports various kinds of zerocopy, for TX, it has supported 
sendfile() for many years.
2) MSG_ZEROCOPY is only used for TX but not RX
3) TCP rx zerocopy is only supported via mmap() which requires some 
extra configurations e.g 4K MTU, driver support for header split etc.

[1] https://www.youtube.com/watch?v=_ZfiQGWFvg0

Thanks


>
>  From the safe side we may want to only enable one of them until we prove
> they'll work together I guess..
>
> Not a immediate concern as I don't really think any of them is really
> explicitly supported in qemu.
>
> KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> least we need some knob to detect whether kTLS is enabled in gnutls.
>

Leonardo Bras Sept. 2, 2021, 7:27 a.m. UTC | #9

Hello Peter, thank you for this feedback!

On Tue, Aug 31, 2021 at 5:29 PM Peter Xu <peterx@redhat.com> wrote:
> Yes it would be great to be a migration capability in parallel to multifd. At
> initial phase if it's easy to be implemented on multi-fd only, we can add a
> dependency between the caps.  In the future we can remove that dependency when
> the code is ready to go without multifd.  Thanks,

I thought the idea was to use MSG_ZEROCOPY whenever possible, and otherwise
fall back to the default copying mechanism.

On replies to 2/3 I mentioned a new method to QIOChannel interface,
that would fall
back to copying, and that may be a way to not have to use a capability.

Best regards,
Leonardo

Leonardo Bras Sept. 2, 2021, 7:57 a.m. UTC | #10

A few more comments on this one:

On Wed, Sep 1, 2021 at 12:44 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> > From the safe side we may want to only enable one of them until we prove
> > they'll work together I guess..
>
> MPTCP is good when we're network limited for migration
>
> KTLS will be good when we're CPU limited on AES for migration,
> which is essentially always when TLS is used.
>
> ZEROCOPY will be good when we're CPU limited for data copy
> on migration, or to reduce the impact on other concurrent
> VMs on the same CPUs.
>
> Ultimately we woudld benefit from all of them at the same
> time, if it were technically possible todo.

Maybe I misunderstood something, but IIRC KTLS kind of does some
'zerocopy'.

I mean, on an usual setup, we would have a buffer A that would contain
the encrypted data, and a buffer B that would receive the encrypted data,
and a buffer C, in kernel, that would receive a copy of buffer B.

On KTLS, the buffer A would be passed to kernel and be encrypted directly
in buffer C, avoiding one copy.
Oh, it's possible Buffer A would be copied to buffer B', which is
inside the kernel,
and then encrypted to buffer C, but I am not sure if this would make sense.

Anyway, other than B' case, it would make no sense having MSG_ZEROCOPY
in QIOChannel_TLS, so that's something that I need to change in this patchset.

>
> > Not a immediate concern as I don't really think any of them is really
> > explicitly supported in qemu.
>
> QEMU has mptcp support already:
>
>   commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7
>   Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
>   Date:   Wed Apr 21 12:28:34 2021 +0100
>
>     sockets: Support multipath TCP
>
>     Multipath TCP allows combining multiple interfaces/routes into a single
>     socket, with very little work for the user/admin.
>
>     It's enabled by 'mptcp' on most socket addresses:
>
>        ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp
>
> > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> > gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> > least we need some knob to detect whether kTLS is enabled in gnutls.
>
> It isn't possible for gnutls to transparently enable KTLS, because
> GNUTLS doesn't get to see the actual socket directly

Yes, IIRC it uses gnuTLS with callbacks instead.

> - it'll need
> some work in QEMU to enable it.

Maybe it's overkill, but what if we get to implement using KTLS
directly in QEMU,
and fall back to gnuTLS when it's not available?

Leonardo Bras Sept. 2, 2021, 8:08 a.m. UTC | #11

Thanks for contributing Jason!

On Thu, Sep 2, 2021 at 4:23 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/9/1 下午11:35, Peter Xu 写道:
> > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> >> On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> >>> On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> >>>> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> >>>>> Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> >>>>>
> >>>>> Change the send_write() interface of multifd, allowing it to pass down
> >>>>> flags for qio_channel_write*().
> >>>>>
> >>>>> Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> >>>>> other data being sent at the default copying approach.
> >>>>>
> >>>>> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> >>>>> ---
> >>>>>   migration/multifd-zlib.c | 7 ++++---
> >>>>>   migration/multifd-zstd.c | 7 ++++---
> >>>>>   migration/multifd.c      | 9 ++++++---
> >>>>>   migration/multifd.h      | 3 ++-
> >>>>>   4 files changed, 16 insertions(+), 10 deletions(-)
> >>>>> @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> >>>>>               }
> >>>>>
> >>>>>               if (used) {
> >>>>> -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> >>>>> +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> >>>>> +                                                          &local_err);
> >>>> I don't think it is valid to unconditionally enable this feature due to the
> >>>> resource usage implications
> >>>>
> >>>> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >>>>
> >>>>    "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> >>>>     if the socket option was not set, the socket exceeds its optmem
> >>>>     limit or the user exceeds its ulimit on locked pages."
> >>>>
> >>>> The limit on locked pages is something that looks very likely to be
> >>>> exceeded unless you happen to be running a QEMU config that already
> >>>> implies locked memory (eg PCI assignment)
> >>> Yes it would be great to be a migration capability in parallel to multifd. At
> >>> initial phase if it's easy to be implemented on multi-fd only, we can add a
> >>> dependency between the caps.  In the future we can remove that dependency when
> >>> the code is ready to go without multifd.  Thanks,
> >> Also, I'm wondering how zerocopy support interacts with kernel support
> >> for kTLS and multipath-TCP, both of which we want to be able to use
> >> with migration.
> > Copying Jason Wang for net implications between these features on kernel side
>
>
> Note that the MSG_ZEROCOPY is contributed by Google :)
>
>
> > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).
>
>
> I think they can. Anyway kernel can choose to do datacopy when necessary.
>
> Note that the "zerocopy" is probably not correct here. What's better is
> "Enable MSG_ZEROCOPY" since:
>
> 1) kernel supports various kinds of zerocopy, for TX, it has supported
> sendfile() for many years.
> 2) MSG_ZEROCOPY is only used for TX but not RX
> 3) TCP rx zerocopy is only supported via mmap() which requires some
> extra configurations e.g 4K MTU, driver support for header split etc.

RX would be my next challenge :)

>
> [1] https://www.youtube.com/watch?v=_ZfiQGWFvg0

Thank you for sharing!

Best regards,
Leonardo

Daniel P. Berrangé Sept. 2, 2021, 8:20 a.m. UTC | #12

On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> Hello Daniel, thanks for the feedback !
> 
> On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > >
> > > Change the send_write() interface of multifd, allowing it to pass down
> > > flags for qio_channel_write*().
> > >
> > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > other data being sent at the default copying approach.
> > >
> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > ---
> > >  migration/multifd-zlib.c | 7 ++++---
> > >  migration/multifd-zstd.c | 7 ++++---
> > >  migration/multifd.c      | 9 ++++++---
> > >  migration/multifd.h      | 3 ++-
> > >  4 files changed, 16 insertions(+), 10 deletions(-)
> >
> > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > >              }
> > >
> > >              if (used) {
> > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > +                                                          &local_err);
> >
> > I don't think it is valid to unconditionally enable this feature due to the
> > resource usage implications
> >
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >
> >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> >    if the socket option was not set, the socket exceeds its optmem
> >    limit or the user exceeds its ulimit on locked pages."
> 
> You are correct, I unfortunately missed this part in the docs :(
> 
> > The limit on locked pages is something that looks very likely to be
> > exceeded unless you happen to be running a QEMU config that already
> > implies locked memory (eg PCI assignment)
> 
> Do you mean the limit an user has on locking memory?

Yes, by default limit QEMU sees will be something very small.

> If so, that makes sense. I remember I needed to set the upper limit of locked
> memory for the user before using it, or adding a capability to qemu before.
> 
> Maybe an option would be trying to mlock all guest memory before setting
> zerocopy=on in qemu code. If it fails, we can print an error message and fall
> back to not using zerocopy (following the idea of a new io_async_writev()
> I told you in the previous mail).

Currently ability to lock memory is something that has to be configured
when QEMU starts, and it requires libvirt to grant suitable permissions
to QEMU. Memory locking is generally undesirable because it prevents
memory overcommit. Or rather if you are allowing memory overcommit, then
allowing memory locking is a way to kill your entire host.

I don't think we can unconditionally grant ability to lock arbitrary
guest RAM at startup, just to satisfy a possible desire to use zerocopy
migration later. Granting it at runtime feels questionable as you now
need to track and predict how much locked memory you've allowed, and
also have possible problems with revokation.

Possibly you could unconditionally grant ability to lock a small amount
of guest RAM at startup, but how small can it be, while still making a
useful difference to migration. It would imply we also need to be very
careful with migration to avoid having too large an amount of outstanding
zerocopy requests to exceed the limit.

IOW, the only clear place in which we can use zerocopy, is where we are
already forced to accept the penalty of locked memory at startup. eg when
the guest is using huge pages and no overcommit, or possibly when the guest
is using PCI device assignment, though in the latter I can't remember if
we allow entire of guest RAM to be locked or not.

Overall the memory locking needs look like a significant constraint that
will affect ability to use this feature.

Regards,
Daniel

Leonardo Bras Sept. 2, 2021, 8:52 a.m. UTC | #13

On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel, thanks for the feedback !
> >
> > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > >
> > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > flags for qio_channel_write*().
> > > >
> > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > other data being sent at the default copying approach.
> > > >
> > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > ---
> > > >  migration/multifd-zlib.c | 7 ++++---
> > > >  migration/multifd-zstd.c | 7 ++++---
> > > >  migration/multifd.c      | 9 ++++++---
> > > >  migration/multifd.h      | 3 ++-
> > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > >
> > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > >              }
> > > >
> > > >              if (used) {
> > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > +                                                          &local_err);
> > >
> > > I don't think it is valid to unconditionally enable this feature due to the
> > > resource usage implications
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > >    if the socket option was not set, the socket exceeds its optmem
> > >    limit or the user exceeds its ulimit on locked pages."
> >
> > You are correct, I unfortunately missed this part in the docs :(
> >
> > > The limit on locked pages is something that looks very likely to be
> > > exceeded unless you happen to be running a QEMU config that already
> > > implies locked memory (eg PCI assignment)
> >
> > Do you mean the limit an user has on locking memory?
>
> Yes, by default limit QEMU sees will be something very small.
>
> > If so, that makes sense. I remember I needed to set the upper limit of locked
> > memory for the user before using it, or adding a capability to qemu before.
> >
> > Maybe an option would be trying to mlock all guest memory before setting
> > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > back to not using zerocopy (following the idea of a new io_async_writev()
> > I told you in the previous mail).
>
> Currently ability to lock memory is something that has to be configured
> when QEMU starts, and it requires libvirt to grant suitable permissions
> to QEMU. Memory locking is generally undesirable because it prevents
> memory overcommit. Or rather if you are allowing memory overcommit, then
> allowing memory locking is a way to kill your entire host.

You mean it's gonna consume too much memory, or something else?

>
> I don't think we can unconditionally grant ability to lock arbitrary
> guest RAM at startup, just to satisfy a possible desire to use zerocopy
> migration later. Granting it at runtime feels questionable as you now
> need to track and predict how much locked memory you've allowed, and
> also have possible problems with revokation.

(I am really new to this, so please forgive me if I am asking dumb or
overly basic questions)

What does revokation means in this context?
You give the process hability to lock n MB of memory, and then you take it?
Why would that happen? Is Locked memory a limited resource?

>
> Possibly you could unconditionally grant ability to lock a small amount
> of guest RAM at startup, but how small can it be, while still making a
> useful difference to migration. It would imply we also need to be very
> careful with migration to avoid having too large an amount of outstanding
> zerocopy requests to exceed the limit.

Yeah, having to decide on a value that would be ok to lock is very
complex, given we can migrate with multifd, which can make this value grow
a lot. Except if we only allow a few of those fds to really use zerocopy.

>
> IOW, the only clear place in which we can use zerocopy, is where we are
> already forced to accept the penalty of locked memory at startup. eg when
> the guest is using huge pages and no overcommit, or possibly when the guest
> is using PCI device assignment,

It would be something already, given that those scenarios are the ones with
the largest number of pages to migrate. But I understand we could give it a try
on other scenarios.

> though in the latter I can't remember if
> we allow entire of guest RAM to be locked or not.

If I recall correctly on a previous discussion, this was the case at least for
PCI passthrough.

>
> Overall the memory locking needs look like a significant constraint that
> will affect ability to use this feature.
>

I Agree, there is a lot to take in account.
In any way, those constraints could be checked at the same function as
the setsockopt() right?
(Setting up configs to improve the chance of zerocopy would probably only
happen at/before qemu starting, right?)

Best regards,
Leonardo

Daniel P. Berrangé Sept. 2, 2021, 9:20 a.m. UTC | #14

On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > Hello Daniel, thanks for the feedback !
> > >
> > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > >
> > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > flags for qio_channel_write*().
> > > > >
> > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > other data being sent at the default copying approach.
> > > > >
> > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > ---
> > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > >  migration/multifd.c      | 9 ++++++---
> > > > >  migration/multifd.h      | 3 ++-
> > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > >
> > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > >              }
> > > > >
> > > > >              if (used) {
> > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > +                                                          &local_err);
> > > >
> > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > resource usage implications
> > > >
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > >
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > >    if the socket option was not set, the socket exceeds its optmem
> > > >    limit or the user exceeds its ulimit on locked pages."
> > >
> > > You are correct, I unfortunately missed this part in the docs :(
> > >
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > >
> > > Do you mean the limit an user has on locking memory?
> >
> > Yes, by default limit QEMU sees will be something very small.
> >
> > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > memory for the user before using it, or adding a capability to qemu before.
> > >
> > > Maybe an option would be trying to mlock all guest memory before setting
> > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > I told you in the previous mail).
> >
> > Currently ability to lock memory is something that has to be configured
> > when QEMU starts, and it requires libvirt to grant suitable permissions
> > to QEMU. Memory locking is generally undesirable because it prevents
> > memory overcommit. Or rather if you are allowing memory overcommit, then
> > allowing memory locking is a way to kill your entire host.
> 
> You mean it's gonna consume too much memory, or something else?

Essentially yes. 

> > I don't think we can unconditionally grant ability to lock arbitrary
> > guest RAM at startup, just to satisfy a possible desire to use zerocopy
> > migration later. Granting it at runtime feels questionable as you now
> > need to track and predict how much locked memory you've allowed, and
> > also have possible problems with revokation.
> 
> (I am really new to this, so please forgive me if I am asking dumb or
> overly basic questions)
> 
> What does revokation means in this context?
> You give the process hability to lock n MB of memory, and then you take it?
> Why would that happen? Is Locked memory a limited resource?

Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
total.

So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
which exceeds physical RAM. Most of the time this may well be fine
as the VMs don't concurrently need their full RAM allocation, and
worst case they'll get pushed to swap as the kernel re-shares
memory in respose to load. So perhaps each VM only needs 20 GB
resident at any time, but over time one VM can burst upto 32 GB
and then 12 GB of it get swapped out later when inactive.

But now consider if we allowed 2 of the VMs to lock memory for
purposes of migration. Those 2 VMs can now pin 64 GB of memory
in the worst case, leaving no free memory for the 3rd VM or
for the OS. This will likely take down the entire host, regardless
of swap availability.

IOW, if you are overcomitting RAM you have to be extremely
careful about allowing any VM to lock memory. If you do decide
to allow memory locking, you need to make sure that the worst
case locked memory amount still leaves enough unlocked memory
for the OS to be able to effectively manage the overcommit
load via swap.  We definitely can't grant memory locking to
VMs at startup in this scenario, and if we grant it at runtime,
we need to be able to revoke it again later.

These overcommit numbers are a bit more extreme that you'd 
usually do in real world, but it illustrates the genreal
problem. Also bear in mind that QEMU has memory overhead
beyond the guest RAM block, which varies over time, making
accounting quite hard. We have to also assume that QEMU
could have been compromised by a guest breakout, so we
can't assume that migration will play nice - we have to
assume the worst case possible, given the process ulimits.

> > Overall the memory locking needs look like a significant constraint that
> > will affect ability to use this feature.
> >
> 
> I Agree, there is a lot to take in account.
> In any way, those constraints could be checked at the same function as
> the setsockopt() right?

QEMU could possibly check its ulimits to see if it is possible, but it
would be safer to require a migration capability to be explicitly set
by the mgmt app to opt-in to zerocopy.

> (Setting up configs to improve the chance of zerocopy would probably only
> happen at/before qemu starting, right?)

Usually, but you can change ulimits on the fly for a process. I'm just
not sure of semantics if you reduce limits and existing usage exceeds
the reduced value.

Regards,
Daniel

Leonardo Bras Sept. 2, 2021, 9:49 a.m. UTC | #15

On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > Hello Daniel, thanks for the feedback !
> > > >
> > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > >
> > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > flags for qio_channel_write*().
> > > > > >
> > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > other data being sent at the default copying approach.
> > > > > >
> > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > ---
> > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > >  migration/multifd.h      | 3 ++-
> > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > >
> > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > >              }
> > > > > >
> > > > > >              if (used) {
> > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > +                                                          &local_err);
> > > > >
> > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > >    limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > >
> > > Yes, by default limit QEMU sees will be something very small.
> > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > memory for the user before using it, or adding a capability to qemu before.
> > > >
> > > > Maybe an option would be trying to mlock all guest memory before setting
> > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > > I told you in the previous mail).
> > >
> > > Currently ability to lock memory is something that has to be configured
> > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > to QEMU. Memory locking is generally undesirable because it prevents
> > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > allowing memory locking is a way to kill your entire host.
> >
> > You mean it's gonna consume too much memory, or something else?
>
> Essentially yes.

Well, maybe we can check for available memory before doing that,
but maybe it's too much effort.

>
> > > I don't think we can unconditionally grant ability to lock arbitrary
> > > guest RAM at startup, just to satisfy a possible desire to use zerocopy
> > > migration later. Granting it at runtime feels questionable as you now
> > > need to track and predict how much locked memory you've allowed, and
> > > also have possible problems with revokation.
> >
> > (I am really new to this, so please forgive me if I am asking dumb or
> > overly basic questions)
> >
> > What does revokation means in this context?
> > You give the process hability to lock n MB of memory, and then you take it?
> > Why would that happen? Is Locked memory a limited resource?
>
> Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> total.
>
> So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> which exceeds physical RAM. Most of the time this may well be fine
> as the VMs don't concurrently need their full RAM allocation, and
> worst case they'll get pushed to swap as the kernel re-shares
> memory in respose to load. So perhaps each VM only needs 20 GB
> resident at any time, but over time one VM can burst upto 32 GB
> and then 12 GB of it get swapped out later when inactive.
>
> But now consider if we allowed 2 of the VMs to lock memory for
> purposes of migration. Those 2 VMs can now pin 64 GB of memory
> in the worst case, leaving no free memory for the 3rd VM or
> for the OS. This will likely take down the entire host, regardless
> of swap availability.
>
> IOW, if you are overcomitting RAM you have to be extremely
> careful about allowing any VM to lock memory. If you do decide
> to allow memory locking, you need to make sure that the worst
> case locked memory amount still leaves enough unlocked memory
> for the OS to be able to effectively manage the overcommit
> load via swap.  We definitely can't grant memory locking to
> VMs at startup in this scenario, and if we grant it at runtime,
> we need to be able to revoke it again later.
>
> These overcommit numbers are a bit more extreme that you'd
> usually do in real world, but it illustrates the genreal
> problem. Also bear in mind that QEMU has memory overhead
> beyond the guest RAM block, which varies over time, making
> accounting quite hard. We have to also assume that QEMU
> could have been compromised by a guest breakout, so we
> can't assume that migration will play nice - we have to
> assume the worst case possible, given the process ulimits.
>

Yeah, that makes sense. Thanks for this illustration and elucidation !

I assume there is no way of asking the OS to lock memory, and if there is
no space available, it fails and rolls back the locking.

If there was, the VM2 would fail the locking and we could achieve
MSG_ZEROCOPY at least in VM1.

But thinking in a better way, if both could lock the required memory, and little
was available to VM3,  it could experience major slowdown, as would the
host OS. Yeah, that would be complex.

>
> > > Overall the memory locking needs look like a significant constraint that
> > > will affect ability to use this feature.
> > >
> >
> > I Agree, there is a lot to take in account.
> > In any way, those constraints could be checked at the same function as
> > the setsockopt() right?
>
> QEMU could possibly check its ulimits to see if it is possible, but it
> would be safer to require a migration capability to be explicitly set
> by the mgmt app to opt-in to zerocopy.

Yeah, that would make sense. Let the mgmt app check system resources
and choose where to request zerocopy.  It also has access to knowing if
that process has enough memory locked.

>
> > (Setting up configs to improve the chance of zerocopy would probably only
> > happen at/before qemu starting, right?)
>
> Usually, but you can change ulimits on the fly for a process. I'm just
> not sure of semantics if you reduce limits and existing usage exceeds
> the reduced value.

Yeah, that's a good question.

>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Best regards,
Leonardo Bras

Daniel P. Berrangé Sept. 2, 2021, 9:59 a.m. UTC | #16

On Thu, Sep 02, 2021 at 06:49:06AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > > Hello Daniel, thanks for the feedback !
> > > > >
> > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > > >
> > > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > > flags for qio_channel_write*().
> > > > > > >
> > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > > other data being sent at the default copying approach.
> > > > > > >
> > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > > ---
> > > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > > >  migration/multifd.h      | 3 ++-
> > > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > >
> > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > > >              }
> > > > > > >
> > > > > > >              if (used) {
> > > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > > +                                                          &local_err);
> > > > > >
> > > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > > resource usage implications
> > > > > >
> > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > >
> > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > > >    limit or the user exceeds its ulimit on locked pages."
> > > > >
> > > > > You are correct, I unfortunately missed this part in the docs :(
> > > > >
> > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > implies locked memory (eg PCI assignment)
> > > > >
> > > > > Do you mean the limit an user has on locking memory?
> > > >
> > > > Yes, by default limit QEMU sees will be something very small.
> > > >
> > > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > > memory for the user before using it, or adding a capability to qemu before.
> > > > >
> > > > > Maybe an option would be trying to mlock all guest memory before setting
> > > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > > > I told you in the previous mail).
> > > >
> > > > Currently ability to lock memory is something that has to be configured
> > > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > > to QEMU. Memory locking is generally undesirable because it prevents
> > > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > > allowing memory locking is a way to kill your entire host.
> > >
> > > You mean it's gonna consume too much memory, or something else?
> >
> > Essentially yes.
> 
> Well, maybe we can check for available memory before doing that,
> but maybe it's too much effort.

From a mgmt app POV, we assume QEMU is untrustworthy, so the mgmt
app needs to enforce policy based on the worst case that the
VM configuration allows for.

Checking current available memory is not viable, because even
if you see 10 GB free, QEMU can't know if that free memory is
there to satisfy other VMs's worst case needs, or its own. QEMU
needs to be explicitly told whether its OK to use locked memory,
and an enforcement policy applied to it.


> > Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> > overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> > total.
> >
> > So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> > which exceeds physical RAM. Most of the time this may well be fine
> > as the VMs don't concurrently need their full RAM allocation, and
> > worst case they'll get pushed to swap as the kernel re-shares
> > memory in respose to load. So perhaps each VM only needs 20 GB
> > resident at any time, but over time one VM can burst upto 32 GB
> > and then 12 GB of it get swapped out later when inactive.
> >
> > But now consider if we allowed 2 of the VMs to lock memory for
> > purposes of migration. Those 2 VMs can now pin 64 GB of memory
> > in the worst case, leaving no free memory for the 3rd VM or
> > for the OS. This will likely take down the entire host, regardless
> > of swap availability.
> >
> > IOW, if you are overcomitting RAM you have to be extremely
> > careful about allowing any VM to lock memory. If you do decide
> > to allow memory locking, you need to make sure that the worst
> > case locked memory amount still leaves enough unlocked memory
> > for the OS to be able to effectively manage the overcommit
> > load via swap.  We definitely can't grant memory locking to
> > VMs at startup in this scenario, and if we grant it at runtime,
> > we need to be able to revoke it again later.
> >
> > These overcommit numbers are a bit more extreme that you'd
> > usually do in real world, but it illustrates the genreal
> > problem. Also bear in mind that QEMU has memory overhead
> > beyond the guest RAM block, which varies over time, making
> > accounting quite hard. We have to also assume that QEMU
> > could have been compromised by a guest breakout, so we
> > can't assume that migration will play nice - we have to
> > assume the worst case possible, given the process ulimits.
> >
> 
> Yeah, that makes sense. Thanks for this illustration and elucidation !
> 
> I assume there is no way of asking the OS to lock memory, and if there is
> no space available, it fails and rolls back the locking.

Yes & no. On most Linux configs though it ends up being no, because
instead of getting a nice failure, when host memory pressure occurs,
the OOM Killer wakes up and just culls processes.

Regards,
Daniel

Leonardo Bras Sept. 2, 2021, 10:25 a.m. UTC | #17

On Thu, Sep 2, 2021 at 6:59 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 06:49:06AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > > > Hello Daniel, thanks for the feedback !
> > > > > >
> > > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > > > >
> > > > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > > > flags for qio_channel_write*().
> > > > > > > >
> > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > > > other data being sent at the default copying approach.
> > > > > > > >
> > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > > > ---
> > > > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > > > >  migration/multifd.h      | 3 ++-
> > > > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > > >
> > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > > > >              }
> > > > > > > >
> > > > > > > >              if (used) {
> > > > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > > > +                                                          &local_err);
> > > > > > >
> > > > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > > > resource usage implications
> > > > > > >
> > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > > >
> > > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > > > >    limit or the user exceeds its ulimit on locked pages."
> > > > > >
> > > > > > You are correct, I unfortunately missed this part in the docs :(
> > > > > >
> > > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > > implies locked memory (eg PCI assignment)
> > > > > >
> > > > > > Do you mean the limit an user has on locking memory?
> > > > >
> > > > > Yes, by default limit QEMU sees will be something very small.
> > > > >
> > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > > > memory for the user before using it, or adding a capability to qemu before.
> > > > > >
> > > > > > Maybe an option would be trying to mlock all guest memory before setting
> > > > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > > > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > > > > I told you in the previous mail).
> > > > >
> > > > > Currently ability to lock memory is something that has to be configured
> > > > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > > > to QEMU. Memory locking is generally undesirable because it prevents
> > > > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > > > allowing memory locking is a way to kill your entire host.
> > > >
> > > > You mean it's gonna consume too much memory, or something else?
> > >
> > > Essentially yes.
> >
> > Well, maybe we can check for available memory before doing that,
> > but maybe it's too much effort.
>
> From a mgmt app POV, we assume QEMU is untrustworthy, so the mgmt
> app needs to enforce policy based on the worst case that the
> VM configuration allows for.
>
> Checking current available memory is not viable, because even
> if you see 10 GB free, QEMU can't know if that free memory is
> there to satisfy other VMs's worst case needs, or its own. QEMU
> needs to be explicitly told whether its OK to use locked memory,
> and an enforcement policy applied to it.


Yeah, it makes sense to let the mgmt app deal with that and enable/disable
the MSG_ZEROCOPY on migration whenever it sees fit.


>
>
> > > Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> > > overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> > > total.
> > >
> > > So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> > > which exceeds physical RAM. Most of the time this may well be fine
> > > as the VMs don't concurrently need their full RAM allocation, and
> > > worst case they'll get pushed to swap as the kernel re-shares
> > > memory in respose to load. So perhaps each VM only needs 20 GB
> > > resident at any time, but over time one VM can burst upto 32 GB
> > > and then 12 GB of it get swapped out later when inactive.
> > >
> > > But now consider if we allowed 2 of the VMs to lock memory for
> > > purposes of migration. Those 2 VMs can now pin 64 GB of memory
> > > in the worst case, leaving no free memory for the 3rd VM or
> > > for the OS. This will likely take down the entire host, regardless
> > > of swap availability.
> > >
> > > IOW, if you are overcomitting RAM you have to be extremely
> > > careful about allowing any VM to lock memory. If you do decide
> > > to allow memory locking, you need to make sure that the worst
> > > case locked memory amount still leaves enough unlocked memory
> > > for the OS to be able to effectively manage the overcommit
> > > load via swap.  We definitely can't grant memory locking to
> > > VMs at startup in this scenario, and if we grant it at runtime,
> > > we need to be able to revoke it again later.
> > >
> > > These overcommit numbers are a bit more extreme that you'd
> > > usually do in real world, but it illustrates the genreal
> > > problem. Also bear in mind that QEMU has memory overhead
> > > beyond the guest RAM block, which varies over time, making
> > > accounting quite hard. We have to also assume that QEMU
> > > could have been compromised by a guest breakout, so we
> > > can't assume that migration will play nice - we have to
> > > assume the worst case possible, given the process ulimits.
> > >
> >
> > Yeah, that makes sense. Thanks for this illustration and elucidation !
> >
> > I assume there is no way of asking the OS to lock memory, and if there is
> > no space available, it fails and rolls back the locking.
>
> Yes & no. On most Linux configs though it ends up being no, because
> instead of getting a nice failure, when host memory pressure occurs,
> the OOM Killer wakes up and just culls processes.

oh, right the OOM Killer :)

>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Thanks!

Best regards,
Leonardo

Dr. David Alan Gilbert Sept. 7, 2021, 11:13 a.m. UTC | #18

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Wed, Sep 01, 2021 at 11:35:33AM -0400, Peter Xu wrote:
> > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> > > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > > 
> > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > flags for qio_channel_write*().
> > > > > > 
> > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > other data being sent at the default copying approach.
> > > > > > 
> > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > ---
> > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > >  migration/multifd.h      | 3 ++-
> > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > 
> > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > >              }
> > > > > >  
> > > > > >              if (used) {
> > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > +                                                          &local_err);
> > > > > 
> > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > resource usage implications
> > > > > 
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > 
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> > > > >    if the socket option was not set, the socket exceeds its optmem 
> > > > >    limit or the user exceeds its ulimit on locked pages."
> > > > > 
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > > 
> > > > Yes it would be great to be a migration capability in parallel to multifd. At
> > > > initial phase if it's easy to be implemented on multi-fd only, we can add a
> > > > dependency between the caps.  In the future we can remove that dependency when
> > > > the code is ready to go without multifd.  Thanks,
> > > 
> > > Also, I'm wondering how zerocopy support interacts with kernel support
> > > for kTLS and multipath-TCP, both of which we want to be able to use
> > > with migration.
> > 
> > Copying Jason Wang for net implications between these features on kernel side
> > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).
> > 
> > From the safe side we may want to only enable one of them until we prove
> > they'll work together I guess..
> 
> MPTCP is good when we're network limited for migration
> 
> KTLS will be good when we're CPU limited on AES for migration,
> which is essentially always when TLS is used.
> 
> ZEROCOPY will be good when we're CPU limited for data copy
> on migration, or to reduce the impact on other concurrent
> VMs on the same CPUs.
> 
> Ultimately we woudld benefit from all of them at the same
> time, if it were technically possible todo.

I think last time I spoke to Paolo Abeni there were some interactions
between them; I can't remember what though (I think mptcp and ktls
didn't play at the time).

Dave

> > Not a immediate concern as I don't really think any of them is really
> > explicitly supported in qemu.
> 
> QEMU has mptcp support already:
> 
>   commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7
>   Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
>   Date:   Wed Apr 21 12:28:34 2021 +0100
> 
>     sockets: Support multipath TCP
>     
>     Multipath TCP allows combining multiple interfaces/routes into a single
>     socket, with very little work for the user/admin.
>     
>     It's enabled by 'mptcp' on most socket addresses:
>     
>        ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp
> 
> > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> > gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> > least we need some knob to detect whether kTLS is enabled in gnutls.
> 
> It isn't possible for gnutls to transparently enable KTLS, because
> GNUTLS doesn't get to see the actual socket directly - it'll need
> some work in QEMU to enable it.  We know MPTCP and KTLS are currently
> mutually exclusive as they both use the same kernel network hooks
> framework.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Dr. David Alan Gilbert Sept. 7, 2021, 11:17 a.m. UTC | #19

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > Hello Daniel, thanks for the feedback !
> > > >
> > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > >
> > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > flags for qio_channel_write*().
> > > > > >
> > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > other data being sent at the default copying approach.
> > > > > >
> > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > ---
> > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > >  migration/multifd.h      | 3 ++-
> > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > >
> > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > >              }
> > > > > >
> > > > > >              if (used) {
> > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > +                                                          &local_err);
> > > > >
> > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > >    limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > >
> > > Yes, by default limit QEMU sees will be something very small.
> > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > memory for the user before using it, or adding a capability to qemu before.
> > > >
> > > > Maybe an option would be trying to mlock all guest memory before setting
> > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > > I told you in the previous mail).
> > >
> > > Currently ability to lock memory is something that has to be configured
> > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > to QEMU. Memory locking is generally undesirable because it prevents
> > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > allowing memory locking is a way to kill your entire host.
> > 
> > You mean it's gonna consume too much memory, or something else?
> 
> Essentially yes. 
> 
> > > I don't think we can unconditionally grant ability to lock arbitrary
> > > guest RAM at startup, just to satisfy a possible desire to use zerocopy
> > > migration later. Granting it at runtime feels questionable as you now
> > > need to track and predict how much locked memory you've allowed, and
> > > also have possible problems with revokation.
> > 
> > (I am really new to this, so please forgive me if I am asking dumb or
> > overly basic questions)
> > 
> > What does revokation means in this context?
> > You give the process hability to lock n MB of memory, and then you take it?
> > Why would that happen? Is Locked memory a limited resource?
> 
> Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> total.
> 
> So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> which exceeds physical RAM. Most of the time this may well be fine
> as the VMs don't concurrently need their full RAM allocation, and
> worst case they'll get pushed to swap as the kernel re-shares
> memory in respose to load. So perhaps each VM only needs 20 GB
> resident at any time, but over time one VM can burst upto 32 GB
> and then 12 GB of it get swapped out later when inactive.
> 
> But now consider if we allowed 2 of the VMs to lock memory for
> purposes of migration. Those 2 VMs can now pin 64 GB of memory
> in the worst case, leaving no free memory for the 3rd VM or
> for the OS. This will likely take down the entire host, regardless
> of swap availability.
> 
> IOW, if you are overcomitting RAM you have to be extremely
> careful about allowing any VM to lock memory. If you do decide
> to allow memory locking, you need to make sure that the worst
> case locked memory amount still leaves enough unlocked memory
> for the OS to be able to effectively manage the overcommit
> load via swap.  We definitely can't grant memory locking to
> VMs at startup in this scenario, and if we grant it at runtime,
> we need to be able to revoke it again later.
> 
> These overcommit numbers are a bit more extreme that you'd 
> usually do in real world, but it illustrates the genreal
> problem. Also bear in mind that QEMU has memory overhead
> beyond the guest RAM block, which varies over time, making
> accounting quite hard. We have to also assume that QEMU
> could have been compromised by a guest breakout, so we
> can't assume that migration will play nice - we have to
> assume the worst case possible, given the process ulimits.

We already have the same problem for RDMA; 
(Although it has some options for doing rolling locking in chunks
and recently there's code for use with new cards that don't need
locking).

I think the thing here is to offer zerocopy as an option; then people
can decide on the costs etc.

Dave

> 
> > > Overall the memory locking needs look like a significant constraint that
> > > will affect ability to use this feature.
> > >
> > 
> > I Agree, there is a lot to take in account.
> > In any way, those constraints could be checked at the same function as
> > the setsockopt() right?
> 
> QEMU could possibly check its ulimits to see if it is possible, but it
> would be safer to require a migration capability to be explicitly set
> by the mgmt app to opt-in to zerocopy.
> 
> > (Setting up configs to improve the chance of zerocopy would probably only
> > happen at/before qemu starting, right?)
> 
> Usually, but you can change ulimits on the fly for a process. I'm just
> not sure of semantics if you reduce limits and existing usage exceeds
> the reduced value.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Peter Xu Sept. 7, 2021, 6:32 p.m. UTC | #20

On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > I don't think it is valid to unconditionally enable this feature due to the
> > resource usage implications
> >
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >
> >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> >    if the socket option was not set, the socket exceeds its optmem
> >    limit or the user exceeds its ulimit on locked pages."
> 
> You are correct, I unfortunately missed this part in the docs :(
> 
> > The limit on locked pages is something that looks very likely to be
> > exceeded unless you happen to be running a QEMU config that already
> > implies locked memory (eg PCI assignment)
> 
> Do you mean the limit an user has on locking memory?
> 
> If so, that makes sense. I remember I needed to set the upper limit of locked
> memory for the user before using it, or adding a capability to qemu before.

So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.

The thing is IIUC that's accounting for pinned pages only with either mlock()
(FOLL_MLOCK) or vfio (FOLL_PIN).

I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
__zerocopy_sg_from_iter() -> iov_iter_get_pages().

Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, I
think most of them do no need such accountings.

Jason Wang Sept. 8, 2021, 2:59 a.m. UTC | #21

On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > I don't think it is valid to unconditionally enable this feature due to the
> > > resource usage implications
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > >    if the socket option was not set, the socket exceeds its optmem
> > >    limit or the user exceeds its ulimit on locked pages."
> >
> > You are correct, I unfortunately missed this part in the docs :(
> >
> > > The limit on locked pages is something that looks very likely to be
> > > exceeded unless you happen to be running a QEMU config that already
> > > implies locked memory (eg PCI assignment)
> >
> > Do you mean the limit an user has on locking memory?
> >
> > If so, that makes sense. I remember I needed to set the upper limit of locked
> > memory for the user before using it, or adding a capability to qemu before.
>
> So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
>
> The thing is IIUC that's accounting for pinned pages only with either mlock()
> (FOLL_MLOCK) or vfio (FOLL_PIN).
>
> I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> __zerocopy_sg_from_iter() -> iov_iter_get_pages().

It happens probably here:

E.g

__ip_append_data()
    msg_zerocopy_realloc()
        mm_account_pinned_pages()

Thanks

>
> Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, I
> think most of them do no need such accountings.
>
> --
> Peter Xu
>

Peter Xu Sept. 8, 2021, 3:24 a.m. UTC | #22

On Wed, Sep 08, 2021 at 10:59:57AM +0800, Jason Wang wrote:
> On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > resource usage implications
> > > >
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > >
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > >    if the socket option was not set, the socket exceeds its optmem
> > > >    limit or the user exceeds its ulimit on locked pages."
> > >
> > > You are correct, I unfortunately missed this part in the docs :(
> > >
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > >
> > > Do you mean the limit an user has on locking memory?
> > >
> > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > memory for the user before using it, or adding a capability to qemu before.
> >
> > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> >
> > The thing is IIUC that's accounting for pinned pages only with either mlock()
> > (FOLL_MLOCK) or vfio (FOLL_PIN).
> >
> > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> 
> It happens probably here:
> 
> E.g
> 
> __ip_append_data()
>     msg_zerocopy_realloc()
>         mm_account_pinned_pages()

Right. :)

But my previous question is more about the reason behind it - I thought that's
a common GUP and it shouldn't rely on locked_vm because it should still be
temporary GUP rather than a long time GUP, IMHO that's how we use locked_vm (we
don't account for temp GUPs but only longterm ones). IOW, I'm wondering whether
all the rest of iov_iter_get_pages() callers should check locked_vm too, and
AFAIU they're not doing that right now..

Thanks,

Jason Wang Sept. 8, 2021, 3:26 a.m. UTC | #23

On Wed, Sep 8, 2021 at 11:24 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 08, 2021 at 10:59:57AM +0800, Jason Wang wrote:
> > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > >    limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > memory for the user before using it, or adding a capability to qemu before.
> > >
> > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> > >
> > > The thing is IIUC that's accounting for pinned pages only with either mlock()
> > > (FOLL_MLOCK) or vfio (FOLL_PIN).
> > >
> > > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> >
> > It happens probably here:
> >
> > E.g
> >
> > __ip_append_data()
> >     msg_zerocopy_realloc()
> >         mm_account_pinned_pages()
>
> Right. :)
>
> But my previous question is more about the reason behind it - I thought that's
> a common GUP and it shouldn't rely on locked_vm because it should still be
> temporary GUP rather than a long time GUP, IMHO that's how we use locked_vm (we
> don't account for temp GUPs but only longterm ones). IOW, I'm wondering whether
> all the rest of iov_iter_get_pages() callers should check locked_vm too, and
> AFAIU they're not doing that right now..

You are probably right, but I'm not sure if it's too late to fix that.
(E.g it will break existing userspace)

Thanks

>
> Thanks,
>
> --
> Peter Xu
>

Dr. David Alan Gilbert Sept. 8, 2021, 8:19 a.m. UTC | #24

* Jason Wang (jasowang@redhat.com) wrote:
> On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > resource usage implications
> > > >
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > >
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > >    if the socket option was not set, the socket exceeds its optmem
> > > >    limit or the user exceeds its ulimit on locked pages."
> > >
> > > You are correct, I unfortunately missed this part in the docs :(
> > >
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > >
> > > Do you mean the limit an user has on locking memory?
> > >
> > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > memory for the user before using it, or adding a capability to qemu before.
> >
> > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> >
> > The thing is IIUC that's accounting for pinned pages only with either mlock()
> > (FOLL_MLOCK) or vfio (FOLL_PIN).
> >
> > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> 
> It happens probably here:
> 
> E.g
> 
> __ip_append_data()
>     msg_zerocopy_realloc()
>         mm_account_pinned_pages()

When do they get uncounted?  i.e. is it just the data that's in flight
that's marked as pinned?

Dave

> Thanks
> 
> >
> > Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, I
> > think most of them do no need such accountings.
> >
> > --
> > Peter Xu
> >
>

Peter Xu Sept. 8, 2021, 3:19 p.m. UTC | #25

On Wed, Sep 08, 2021 at 09:19:27AM +0100, Dr. David Alan Gilbert wrote:
> * Jason Wang (jasowang@redhat.com) wrote:
> > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > >    limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > memory for the user before using it, or adding a capability to qemu before.
> > >
> > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> > >
> > > The thing is IIUC that's accounting for pinned pages only with either mlock()
> > > (FOLL_MLOCK) or vfio (FOLL_PIN).
> > >
> > > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> > 
> > It happens probably here:
> > 
> > E.g
> > 
> > __ip_append_data()
> >     msg_zerocopy_realloc()
> >         mm_account_pinned_pages()
> 
> When do they get uncounted?  i.e. is it just the data that's in flight
> that's marked as pinned?

I think so - there's __msg_zerocopy_callback() -> mm_unaccount_pinned_pages()
correspondingly.  Thanks,

Daniel P. Berrangé Sept. 8, 2021, 3:26 p.m. UTC | #26

On Tue, Sep 07, 2021 at 12:13:28PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Wed, Sep 01, 2021 at 11:35:33AM -0400, Peter Xu wrote:
> > > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> > > > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > > > 
> > > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > > flags for qio_channel_write*().
> > > > > > > 
> > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > > other data being sent at the default copying approach.
> > > > > > > 
> > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > > ---
> > > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > > >  migration/multifd.h      | 3 ++-
> > > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > > 
> > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > > >              }
> > > > > > >  
> > > > > > >              if (used) {
> > > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > > +                                                          &local_err);
> > > > > > 
> > > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > > resource usage implications
> > > > > > 
> > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > > 
> > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> > > > > >    if the socket option was not set, the socket exceeds its optmem 
> > > > > >    limit or the user exceeds its ulimit on locked pages."
> > > > > > 
> > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > implies locked memory (eg PCI assignment)
> > > > > 
> > > > > Yes it would be great to be a migration capability in parallel to multifd. At
> > > > > initial phase if it's easy to be implemented on multi-fd only, we can add a
> > > > > dependency between the caps.  In the future we can remove that dependency when
> > > > > the code is ready to go without multifd.  Thanks,
> > > > 
> > > > Also, I'm wondering how zerocopy support interacts with kernel support
> > > > for kTLS and multipath-TCP, both of which we want to be able to use
> > > > with migration.
> > > 
> > > Copying Jason Wang for net implications between these features on kernel side
> > > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).
> > > 
> > > From the safe side we may want to only enable one of them until we prove
> > > they'll work together I guess..
> > 
> > MPTCP is good when we're network limited for migration
> > 
> > KTLS will be good when we're CPU limited on AES for migration,
> > which is essentially always when TLS is used.
> > 
> > ZEROCOPY will be good when we're CPU limited for data copy
> > on migration, or to reduce the impact on other concurrent
> > VMs on the same CPUs.
> > 
> > Ultimately we woudld benefit from all of them at the same
> > time, if it were technically possible todo.
> 
> I think last time I spoke to Paolo Abeni there were some interactions
> between them; I can't remember what though (I think mptcp and ktls
> didn't play at the time).

MPTCP and KTLS use the same kernel hook in the network layer and
only 1 hook is permitted at a time, making them mutually exclusive
for now. In theory this can be fixed but no one is actively working
on it yet.


Regards,
Daniel

Jason Wang Sept. 9, 2021, 1:10 a.m. UTC | #27

On Wed, Sep 8, 2021 at 11:19 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 08, 2021 at 09:19:27AM +0100, Dr. David Alan Gilbert wrote:
> > * Jason Wang (jasowang@redhat.com) wrote:
> > > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > > resource usage implications
> > > > > >
> > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > >
> > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > > >    limit or the user exceeds its ulimit on locked pages."
> > > > >
> > > > > You are correct, I unfortunately missed this part in the docs :(
> > > > >
> > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > implies locked memory (eg PCI assignment)
> > > > >
> > > > > Do you mean the limit an user has on locking memory?
> > > > >
> > > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > > memory for the user before using it, or adding a capability to qemu before.
> > > >
> > > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> > > >
> > > > The thing is IIUC that's accounting for pinned pages only with either mlock()
> > > > (FOLL_MLOCK) or vfio (FOLL_PIN).
> > > >
> > > > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > > > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> > >
> > > It happens probably here:
> > >
> > > E.g
> > >
> > > __ip_append_data()
> > >     msg_zerocopy_realloc()
> > >         mm_account_pinned_pages()
> >
> > When do they get uncounted?  i.e. is it just the data that's in flight
> > that's marked as pinned?
>
> I think so - there's __msg_zerocopy_callback() -> mm_unaccount_pinned_pages()
> correspondingly.  Thanks,

Yes, and actually the memory that could be pinned is limited by the
sndbuf of TCP socket. So we are fine with rlimit (e.g we don't need to
pin all guest pages).

Thanks

>
> --
> Peter Xu
>

[v1,3/3] migration: multifd: Enable zerocopy

Commit Message

Comments

Patch