diff mbox series

[v1,2/3] io: Add zerocopy and errqueue

Message ID 20210831110238.299458-3-leobras@redhat.com
State New
Headers show
Series QIOChannel flags + multifd zerocopy | expand

Commit Message

Leonardo Bras Aug. 31, 2021, 11:02 a.m. UTC
MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
send calls. It does so by avoiding copying user data into kernel buffers.

To make it work, three steps are needed:
1 - A setsockopt() system call, enabling SO_ZEROCOPY
2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
3 - Process the socket's error queue, dealing with any error

Zerocopy has it's costs, so it will only get improved performance if
the sending buffer is big (10KB, according to Linux docs).

The step 2 makes it possible to use the same socket to send data
using both zerocopy and the default copying approach, so the application
cat choose what is best for each packet.

To implement step 1, an optional set_zerocopy() interface was created
in QIOChannel, allowing each using code to enable or disable it.

Step 2 will be enabled by the using code at each qio_channel_write*()
that would benefit of zerocopy;

Step 3 is done with qio_channel_socket_errq_proc(), that runs after
SOCKET_ERRQ_THRESH (16k) iovs sent, dealing with any error found.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/io/channel-socket.h |  2 +
 include/io/channel.h        | 29 ++++++++++++++
 io/channel-socket.c         | 76 +++++++++++++++++++++++++++++++++++++
 io/channel-tls.c            | 11 ++++++
 io/channel-websock.c        |  9 +++++
 io/channel.c                | 11 ++++++
 6 files changed, 138 insertions(+)

Comments

Daniel P. Berrangé Aug. 31, 2021, 12:57 p.m. UTC | #1
On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> send calls. It does so by avoiding copying user data into kernel buffers.
> 
> To make it work, three steps are needed:
> 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> 3 - Process the socket's error queue, dealing with any error

AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.

It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
from a synchronous call to an asynchronous call.

It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
until an asynchronous completion notification has been received from
the socket error queue. These notifications are not required to
arrive in-order, even for a TCP stream, because the kernel hangs on
to the buffer if a re-transmit is needed.

https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html

  "Page pinning also changes system call semantics. It temporarily 
   shares the buffer between process and network stack. Unlike with
   copying, the process cannot immediately overwrite the buffer 
   after system call return without possibly modifying the data in 
   flight. Kernel integrity is not affected, but a buggy program
   can possibly corrupt its own data stream."

AFAICT, the design added in this patch does not provide any way
to honour these requirements around buffer lifetime.

I can't see how we can introduce MSG_ZEROCOPY in any seemless
way. The buffer lifetime requirements imply need for an API
design that is fundamentally different for asynchronous usage,
with a callback to notify when the write has finished/failed.

> Zerocopy has it's costs, so it will only get improved performance if
> the sending buffer is big (10KB, according to Linux docs).
> 
> The step 2 makes it possible to use the same socket to send data
> using both zerocopy and the default copying approach, so the application
> cat choose what is best for each packet.
> 
> To implement step 1, an optional set_zerocopy() interface was created
> in QIOChannel, allowing each using code to enable or disable it.
> 
> Step 2 will be enabled by the using code at each qio_channel_write*()
> that would benefit of zerocopy;
> 
> Step 3 is done with qio_channel_socket_errq_proc(), that runs after
> SOCKET_ERRQ_THRESH (16k) iovs sent, dealing with any error found.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  include/io/channel-socket.h |  2 +
>  include/io/channel.h        | 29 ++++++++++++++
>  io/channel-socket.c         | 76 +++++++++++++++++++++++++++++++++++++
>  io/channel-tls.c            | 11 ++++++
>  io/channel-websock.c        |  9 +++++
>  io/channel.c                | 11 ++++++
>  6 files changed, 138 insertions(+)
> 
> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> index e747e63514..09dffe059f 100644
> --- a/include/io/channel-socket.h
> +++ b/include/io/channel-socket.h
> @@ -47,6 +47,8 @@ struct QIOChannelSocket {
>      socklen_t localAddrLen;
>      struct sockaddr_storage remoteAddr;
>      socklen_t remoteAddrLen;
> +    size_t errq_pending;
> +    bool zerocopy_enabled;
>  };
>  
>  
> diff --git a/include/io/channel.h b/include/io/channel.h
> index dada9ebaaf..de10a78b10 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -137,6 +137,8 @@ struct QIOChannelClass {
>                                    IOHandler *io_read,
>                                    IOHandler *io_write,
>                                    void *opaque);
> +    void (*io_set_zerocopy)(QIOChannel *ioc,
> +                            bool enabled);
>  };
>  
>  /* General I/O handling functions */
> @@ -570,6 +572,33 @@ int qio_channel_shutdown(QIOChannel *ioc,
>  void qio_channel_set_delay(QIOChannel *ioc,
>                             bool enabled);
>  
> +/**
> + * qio_channel_set_zerocopy:
> + * @ioc: the channel object
> + * @enabled: the new flag state
> + *
> + * Controls whether the underlying transport is
> + * permitted to use zerocopy to avoid copying the
> + * sending buffer in kernel. If @enabled is true, then the
> + * writes may avoid buffer copy in kernel. If @enabled
> + * is false, writes will cause the kernel to always
> + * copy the buffer contents before sending.
> + *
> + * In order to use make a write with zerocopy feature,
> + * it's also necessary to sent each packet with
> + * MSG_ZEROCOPY flag. With this, it's possible to
> + * to select only writes that would benefit from the
> + * use of zerocopy feature, i.e. the ones with larger
> + * buffers.
> + *
> + * This feature was added in Linux 4.14, so older
> + * versions will fail on enabling. This is not an
> + * issue, since it will fall-back to default copying
> + * approach.
> + */
> +void qio_channel_set_zerocopy(QIOChannel *ioc,
> +                              bool enabled);
> +
>  /**
>   * qio_channel_set_cork:
>   * @ioc: the channel object
> diff --git a/io/channel-socket.c b/io/channel-socket.c
> index e377e7303d..a69fec7315 100644
> --- a/io/channel-socket.c
> +++ b/io/channel-socket.c
> @@ -26,8 +26,10 @@
>  #include "io/channel-watch.h"
>  #include "trace.h"
>  #include "qapi/clone-visitor.h"
> +#include <linux/errqueue.h>

That's going to fail to biuld on non-Linux 

>  
>  #define SOCKET_MAX_FDS 16
> +#define SOCKET_ERRQ_THRESH 16384
>  
>  SocketAddress *
>  qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
> @@ -55,6 +57,8 @@ qio_channel_socket_new(void)
>  
>      sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
>      sioc->fd = -1;
> +    sioc->zerocopy_enabled = false;
> +    sioc->errq_pending = 0;
>  
>      ioc = QIO_CHANNEL(sioc);
>      qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> @@ -520,6 +524,54 @@ static ssize_t qio_channel_socket_readv(QIOChannel *ioc,
>      return ret;
>  }
>  
> +static void qio_channel_socket_errq_proc(QIOChannelSocket *sioc,
> +                                         Error **errp)
> +{
> +    int fd = sioc->fd;
> +    int ret;
> +    struct msghdr msg = {};
> +    struct sock_extended_err *serr;
> +    struct cmsghdr *cm;
> +
> +    do {
> +        ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
> +        if (ret <= 0) {
> +            if (ret == 0 || errno == EAGAIN) {
> +                /* Nothing on errqueue */
> +                 sioc->errq_pending = 0;
> +                 break;
> +            }
> +            if (errno == EINTR) {
> +                continue;
> +            }
> +
> +            error_setg_errno(errp, errno,
> +                             "Unable to read errqueue");
> +            break;
> +        }
> +
> +        cm = CMSG_FIRSTHDR(&msg);
> +        if (cm->cmsg_level != SOL_IP &&
> +            cm->cmsg_type != IP_RECVERR) {
> +            error_setg_errno(errp, EPROTOTYPE,
> +                             "Wrong cmsg in errqueue");
> +            break;
> +        }
> +
> +        serr = (void *) CMSG_DATA(cm);
> +        if (serr->ee_errno != 0) {
> +            error_setg_errno(errp, serr->ee_errno,
> +                             "Error on socket");
> +            break;
> +        }
> +        if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
> +            error_setg_errno(errp, serr->ee_origin,
> +                             "Error not from zerocopy");
> +            break;
> +        }
> +    } while (true);
> +}
> +
>  static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>                                           const struct iovec *iov,
>                                           size_t niov,
> @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>                           "Unable to write to socket");
>          return -1;
>      }
> +
> +    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
> +        sioc->errq_pending += niov;
> +        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
> +            qio_channel_socket_errq_proc(sioc, errp);
> +        }
> +    }

This silently looses any errors set from upto the final
SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set.

Also means if you have a series of writes without
MSG_ZEROCOPY, it'll delay checking any pending
errors.

I would suggest checkig in close(), but as mentioned
earlier, I think the design is flawed because the caller
fundamentally needs to know about completion for every
single write they make in order to know when the buffer
can be released / reused.

> +static void
> +qio_channel_socket_set_zerocopy(QIOChannel *ioc,
> +                                bool enabled)
> +{
> +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> +    int v = enabled ? 1 : 0;
> +    int ret;
> +
> +    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> +    if (ret >= 0) {
> +        sioc->zerocopy_enabled = true;
> +    }
> +}

Surely we need to tell the caller wether this succeeed, otherwise
the later sendmsg() is going to fail with EINVAL on older kernels
where MSG_ZEROCOPY is not supported.


> diff --git a/io/channel-tls.c b/io/channel-tls.c
> index 4ce890a538..bf44b0f7b0 100644
> --- a/io/channel-tls.c
> +++ b/io/channel-tls.c
> @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
>      qio_channel_set_delay(tioc->master, enabled);
>  }
>  
> +
> +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
> +                                         bool enabled)
> +{
> +    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
> +
> +    qio_channel_set_zerocopy(tioc->master, enabled);
> +}

This is going to be unsafe because gnutls will discard/reuse the
buffer for the ciphertext after every write(). I can't see a
way to safely enable MSG_ZEROCOPY when TLS is used.


Regards,
Daniel
Peter Xu Aug. 31, 2021, 8:27 p.m. UTC | #2
On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > send calls. It does so by avoiding copying user data into kernel buffers.
> > 
> > To make it work, three steps are needed:
> > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > 3 - Process the socket's error queue, dealing with any error
> 
> AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> 
> It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> from a synchronous call to an asynchronous call.
> 
> It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> until an asynchronous completion notification has been received from
> the socket error queue. These notifications are not required to
> arrive in-order, even for a TCP stream, because the kernel hangs on
> to the buffer if a re-transmit is needed.
> 
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> 
>   "Page pinning also changes system call semantics. It temporarily 
>    shares the buffer between process and network stack. Unlike with
>    copying, the process cannot immediately overwrite the buffer 
>    after system call return without possibly modifying the data in 
>    flight. Kernel integrity is not affected, but a buggy program
>    can possibly corrupt its own data stream."
> 
> AFAICT, the design added in this patch does not provide any way
> to honour these requirements around buffer lifetime.
> 
> I can't see how we can introduce MSG_ZEROCOPY in any seemless
> way. The buffer lifetime requirements imply need for an API
> design that is fundamentally different for asynchronous usage,
> with a callback to notify when the write has finished/failed.

Regarding buffer reuse - it indeed has a very deep implication on the buffer
being available and it's not obvious at all.  Just to mention that the initial
user of this work will make sure all zero copy buffers will be guest pages only
(as it's only used in multi-fd), so they should always be there during the
process.

I think asking for a complete design still makes sense.  E.g., for precopy
before we flush device states and completes the migration, we may want to at
least have a final ack on all the zero-copies of guest pages to guarantee they
are flushed.

IOW, we need to make sure the last piece of migration stream lands after the
guest pages so that the dest VM will always contain the latest page data when
dest VM starts.  So far I don't see how current code guaranteed that.

In short, we may just want to at least having a way to make sure all zero
copied buffers are finished using and they're sent after some function returns
(e.g., qio_channel_flush()).  That may require us to do some accounting on when
we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
error queue and keep those information somewhere too.

Some other side notes that reached my mind..

The qio_channel_writev_full() may not be suitable for async operations, as the
name "full" implies synchronous to me.  So maybe we can add a new helper for
zero copy on the channel?

We may also want a new QIOChannelFeature as QIO_CHANNEL_FEATURE_ZEROCOPY, then
we fail qio_channel_writv_zerocopy() (or whatever name we come up with) if that
bit is not set in qio channel features.

Thanks,
Daniel P. Berrangé Sept. 1, 2021, 8:50 a.m. UTC | #3
On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote:
> On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > 
> > > To make it work, three steps are needed:
> > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > 3 - Process the socket's error queue, dealing with any error
> > 
> > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > 
> > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > from a synchronous call to an asynchronous call.
> > 
> > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > until an asynchronous completion notification has been received from
> > the socket error queue. These notifications are not required to
> > arrive in-order, even for a TCP stream, because the kernel hangs on
> > to the buffer if a re-transmit is needed.
> > 
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > 
> >   "Page pinning also changes system call semantics. It temporarily 
> >    shares the buffer between process and network stack. Unlike with
> >    copying, the process cannot immediately overwrite the buffer 
> >    after system call return without possibly modifying the data in 
> >    flight. Kernel integrity is not affected, but a buggy program
> >    can possibly corrupt its own data stream."
> > 
> > AFAICT, the design added in this patch does not provide any way
> > to honour these requirements around buffer lifetime.
> > 
> > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > way. The buffer lifetime requirements imply need for an API
> > design that is fundamentally different for asynchronous usage,
> > with a callback to notify when the write has finished/failed.
> 
> Regarding buffer reuse - it indeed has a very deep implication on the buffer
> being available and it's not obvious at all.  Just to mention that the initial
> user of this work will make sure all zero copy buffers will be guest pages only
> (as it's only used in multi-fd), so they should always be there during the
> process.

That is not the case when migration is using TLS, because the buffers
transmitted are the ciphertext buffer, not the plaintext guest page.

> In short, we may just want to at least having a way to make sure all zero
> copied buffers are finished using and they're sent after some function returns
> (e.g., qio_channel_flush()).  That may require us to do some accounting on when
> we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> error queue and keep those information somewhere too.
> 
> Some other side notes that reached my mind..
> 
> The qio_channel_writev_full() may not be suitable for async operations, as the
> name "full" implies synchronous to me.  So maybe we can add a new helper for
> zero copy on the channel?

All the APIs that exist today are fundamentally only suitable for sync
operations. Supporting async correctly will definitely a brand new APIs
separate from what exists today.

Regards,
Daniel
Peter Xu Sept. 1, 2021, 3:52 p.m. UTC | #4
On Wed, Sep 01, 2021 at 09:50:56AM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote:
> > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > > 
> > > > To make it work, three steps are needed:
> > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > 3 - Process the socket's error queue, dealing with any error
> > > 
> > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > > 
> > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > from a synchronous call to an asynchronous call.
> > > 
> > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > until an asynchronous completion notification has been received from
> > > the socket error queue. These notifications are not required to
> > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > to the buffer if a re-transmit is needed.
> > > 
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > 
> > >   "Page pinning also changes system call semantics. It temporarily 
> > >    shares the buffer between process and network stack. Unlike with
> > >    copying, the process cannot immediately overwrite the buffer 
> > >    after system call return without possibly modifying the data in 
> > >    flight. Kernel integrity is not affected, but a buggy program
> > >    can possibly corrupt its own data stream."
> > > 
> > > AFAICT, the design added in this patch does not provide any way
> > > to honour these requirements around buffer lifetime.
> > > 
> > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > way. The buffer lifetime requirements imply need for an API
> > > design that is fundamentally different for asynchronous usage,
> > > with a callback to notify when the write has finished/failed.
> > 
> > Regarding buffer reuse - it indeed has a very deep implication on the buffer
> > being available and it's not obvious at all.  Just to mention that the initial
> > user of this work will make sure all zero copy buffers will be guest pages only
> > (as it's only used in multi-fd), so they should always be there during the
> > process.
> 
> That is not the case when migration is using TLS, because the buffers
> transmitted are the ciphertext buffer, not the plaintext guest page.

Agreed.

> 
> > In short, we may just want to at least having a way to make sure all zero
> > copied buffers are finished using and they're sent after some function returns
> > (e.g., qio_channel_flush()).  That may require us to do some accounting on when
> > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> > error queue and keep those information somewhere too.
> > 
> > Some other side notes that reached my mind..
> > 
> > The qio_channel_writev_full() may not be suitable for async operations, as the
> > name "full" implies synchronous to me.  So maybe we can add a new helper for
> > zero copy on the channel?
> 
> All the APIs that exist today are fundamentally only suitable for sync
> operations. Supporting async correctly will definitely a brand new APIs
> separate from what exists today.

Yes.  What I wanted to say is maybe we can still keep the io_writev() interface
untouched, but just add a new interface at qio_channel_writev_full() level.

IOW, we could comment on io_writev() that it can be either sync or async now,
just like sendmsg() has that implication too with different flag passed in.
When calling io_writev() with different upper helpers, QIO channel could
identify what flag to pass over to io_writev().
Daniel P. Berrangé Sept. 1, 2021, 3:59 p.m. UTC | #5
On Wed, Sep 01, 2021 at 11:52:13AM -0400, Peter Xu wrote:
> On Wed, Sep 01, 2021 at 09:50:56AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote:
> > > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > > > 
> > > > > To make it work, three steps are needed:
> > > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > > 3 - Process the socket's error queue, dealing with any error
> > > > 
> > > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > > > 
> > > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > > from a synchronous call to an asynchronous call.
> > > > 
> > > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > > until an asynchronous completion notification has been received from
> > > > the socket error queue. These notifications are not required to
> > > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > > to the buffer if a re-transmit is needed.
> > > > 
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > 
> > > >   "Page pinning also changes system call semantics. It temporarily 
> > > >    shares the buffer between process and network stack. Unlike with
> > > >    copying, the process cannot immediately overwrite the buffer 
> > > >    after system call return without possibly modifying the data in 
> > > >    flight. Kernel integrity is not affected, but a buggy program
> > > >    can possibly corrupt its own data stream."
> > > > 
> > > > AFAICT, the design added in this patch does not provide any way
> > > > to honour these requirements around buffer lifetime.
> > > > 
> > > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > > way. The buffer lifetime requirements imply need for an API
> > > > design that is fundamentally different for asynchronous usage,
> > > > with a callback to notify when the write has finished/failed.
> > > 
> > > Regarding buffer reuse - it indeed has a very deep implication on the buffer
> > > being available and it's not obvious at all.  Just to mention that the initial
> > > user of this work will make sure all zero copy buffers will be guest pages only
> > > (as it's only used in multi-fd), so they should always be there during the
> > > process.
> > 
> > That is not the case when migration is using TLS, because the buffers
> > transmitted are the ciphertext buffer, not the plaintext guest page.
> 
> Agreed.
> 
> > 
> > > In short, we may just want to at least having a way to make sure all zero
> > > copied buffers are finished using and they're sent after some function returns
> > > (e.g., qio_channel_flush()).  That may require us to do some accounting on when
> > > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> > > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> > > error queue and keep those information somewhere too.
> > > 
> > > Some other side notes that reached my mind..
> > > 
> > > The qio_channel_writev_full() may not be suitable for async operations, as the
> > > name "full" implies synchronous to me.  So maybe we can add a new helper for
> > > zero copy on the channel?
> > 
> > All the APIs that exist today are fundamentally only suitable for sync
> > operations. Supporting async correctly will definitely a brand new APIs
> > separate from what exists today.
> 
> Yes.  What I wanted to say is maybe we can still keep the io_writev() interface
> untouched, but just add a new interface at qio_channel_writev_full() level.
> 
> IOW, we could comment on io_writev() that it can be either sync or async now,
> just like sendmsg() has that implication too with different flag passed in.
> When calling io_writev() with different upper helpers, QIO channel could
> identify what flag to pass over to io_writev().

I don't think we should overload any of the existing methods with extra
parameters for async. Introduce a completely new set of methods exclusively
for this alternative interaction model.  I can kinda understand why they
took the approach they did with sendmsg, but I wouldn't use it as design
motivation for QEMU (except as example of what not to do).

Regards,
Daniel
Leonardo Bras Sept. 2, 2021, 6:38 a.m. UTC | #6
Hello Daniel, thank you for the feedback!

Comments inline.

On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > send calls. It does so by avoiding copying user data into kernel buffers.
> >
> > To make it work, three steps are needed:
> > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > 3 - Process the socket's error queue, dealing with any error
>
> AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
>
> It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> from a synchronous call to an asynchronous call.

You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in
a somehow synchronous way, but returning errp (and sometimes closing the
channel because of it) was a poor implementation.

>
> It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> until an asynchronous completion notification has been received from
> the socket error queue. These notifications are not required to
> arrive in-order, even for a TCP stream, because the kernel hangs on
> to the buffer if a re-transmit is needed.
>
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
>
>   "Page pinning also changes system call semantics. It temporarily
>    shares the buffer between process and network stack. Unlike with
>    copying, the process cannot immediately overwrite the buffer
>    after system call return without possibly modifying the data in
>    flight. Kernel integrity is not affected, but a buggy program
>    can possibly corrupt its own data stream."
>

By the above piece of documentation, I get there is no problem in
overwriting the buffer, but a corrupt, or newer version of the memory may
be sent instead of the original one. I am pointing this out because there
are workloads like page migration that would not be impacted, given
once the buffer is changed, it will dirty the page and it will be re-sent.

But I agree with you.
It's not a good choice to expect all the users to behave like that,
and since an interface for dealing with those errors is not provided
to the using code, there is no way of using that in other scenarios.

> AFAICT, the design added in this patch does not provide any way
> to honour these requirements around buffer lifetime.
>
> I can't see how we can introduce MSG_ZEROCOPY in any seemless
> way. The buffer lifetime requirements imply need for an API
> design that is fundamentally different for asynchronous usage,
> with a callback to notify when the write has finished/failed.
>

That is a good point.
Proposing a new optional method like io_async_writev() + an error
checking mechanism could do the job.
io_async_writev() could fall-back to io_writev() in cases where it's not
implemented.

I am not sure about the error checking yet.
Options I can see are:
1 - A callback, as you suggested, which IIUC would be provided by
code using the QIOChannel, and would only fix the reported errors,
leaving the responsibility of checking for errors to the IOChannel code.

2 - A new method, maybe io_async_errck(), which would be called
whenever the using code wants to deal with pending errors. It could
return an array/list of IOVs that need to be re-sent, for example,
and code using QIOChannel could deal with it however it wants.

[snip]

> >   * qio_channel_set_cork:
> >   * @ioc: the channel object
> > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > index e377e7303d..a69fec7315 100644
> > --- a/io/channel-socket.c
> > +++ b/io/channel-socket.c
> > @@ -26,8 +26,10 @@
> >  #include "io/channel-watch.h"
> >  #include "trace.h"
> >  #include "qapi/clone-visitor.h"
> > +#include <linux/errqueue.h>
>
> That's going to fail to biuld on non-Linux

Good catch, thanks!

[snip]

> > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> >                           "Unable to write to socket");
> >          return -1;
> >      }
> > +
> > +    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
> > +        sioc->errq_pending += niov;
> > +        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
> > +            qio_channel_socket_errq_proc(sioc, errp);
> > +        }
> > +    }
>
> This silently looses any errors set from upto the final
> SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set.

You are right.

>
> Also means if you have a series of writes without
> MSG_ZEROCOPY, it'll delay checking any pending
> errors.

That's expected... if there are only happening sends without MSG_ZEROCOPY,
it means the ones sent with zerocopy can wait. The problem would be
the above case.

>
> I would suggest checkig in close(), but as mentioned
> earlier, I think the design is flawed because the caller
> fundamentally needs to know about completion for every
> single write they make in order to know when the buffer
> can be released / reused.

Well, there could be a flush mechanism (maybe in io_sync_errck(),
activated with a
parameter flag, or on a different method if callback is preferred):
In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
syscall after each packet sent, and this means the fd gets a signal after each
sendmsg() happens, with error or not.

We could harness this with a poll() and a relatively high timeout:
- We stop sending packets, and then call poll().
- Whenever poll() returns 0, it means a timeout happened, and so it
took too long
without sendmsg() happening, meaning all the packets are sent.
- If it returns anything else, we go back to fixing the errors found (re-send)

The problem may be defining the value of this timeout, but it could be
called only
when zerocopy is active.

What do you think?


>
> > +static void
> > +qio_channel_socket_set_zerocopy(QIOChannel *ioc,
> > +                                bool enabled)
> > +{
> > +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> > +    int v = enabled ? 1 : 0;
> > +    int ret;
> > +
> > +    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> > +    if (ret >= 0) {
> > +        sioc->zerocopy_enabled = true;
> > +    }
> > +}
>
> Surely we need to tell the caller wether this succeeed, otherwise
> the later sendmsg() is going to fail with EINVAL on older kernels
> where MSG_ZEROCOPY is not supported.

Yeah, that was the idea on sioc->zerocopy_enabled, but in fact it
should have been
something like 'sioc->zerocopy_enabled = (ret >= 0) && enabled ', this
way it would
reflect zerocopy only if it was enabled and the syscall worked, and not_zerocopy
otherwise.

or something like this, if we want it to stick with zerocopy if
setting it off fails.
if (ret >= 0) {
    sioc->zerocopy_enabled = enabled;
}

>
>
> > diff --git a/io/channel-tls.c b/io/channel-tls.c
> > index 4ce890a538..bf44b0f7b0 100644
> > --- a/io/channel-tls.c
> > +++ b/io/channel-tls.c
> > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
> >      qio_channel_set_delay(tioc->master, enabled);
> >  }
> >
> > +
> > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
> > +                                         bool enabled)
> > +{
> > +    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
> > +
> > +    qio_channel_set_zerocopy(tioc->master, enabled);
> > +}
>
> This is going to be unsafe because gnutls will discard/reuse the
> buffer for the ciphertext after every write(). I can't see a
> way to safely enable MSG_ZEROCOPY when TLS is used.

Yeah, that makes sense.
It would make more sense to implement KTLS, as IIRC it kind of does
'zerocopy', since it saves the encrypted data directly in kernel buffer.

We could implement KTLS as io_async_writev() for Channel_TLS, and change this
flag to async_enabled. If KTLS is not available, it would fall back to
using gnuTLS
on io_writev, just like it would happen in zerocopy.

>
>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Leonardo Bras Sept. 2, 2021, 6:59 a.m. UTC | #7
Thanks for this feedback Peter!

I ended up reading/replying the e-mails in thread order, so I may have
been redundant
with your argument, sorry about that.

I will add my comments inline, but I will add references to the
previous mail I sent
Daniel, so please read it too.

On Tue, Aug 31, 2021 at 5:27 PM Peter Xu <peterx@redhat.com> wrote:
[snip]
> >
> > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > way. The buffer lifetime requirements imply need for an API
> > design that is fundamentally different for asynchronous usage,
> > with a callback to notify when the write has finished/failed.
>
> Regarding buffer reuse - it indeed has a very deep implication on the buffer
> being available and it's not obvious at all.  Just to mention that the initial
> user of this work will make sure all zero copy buffers will be guest pages only
> (as it's only used in multi-fd), so they should always be there during the
> process.

Thanks for pointing that out, what's what I had in mind at the time.

>
> I think asking for a complete design still makes sense.

Agree, since I am touching QIOChannel, it makes sense to make it work for
other code that uses it too, not only our case.

>  E.g., for precopy
> before we flush device states and completes the migration, we may want to at
> least have a final ack on all the zero-copies of guest pages to guarantee they
> are flushed.
>
> IOW, we need to make sure the last piece of migration stream lands after the
> guest pages so that the dest VM will always contain the latest page data when
> dest VM starts.  So far I don't see how current code guaranteed that.
>
> In short, we may just want to at least having a way to make sure all zero
> copied buffers are finished using and they're sent after some function returns
> (e.g., qio_channel_flush()).  That may require us to do some accounting on when
> we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> error queue and keep those information somewhere too.

Yeah, that's correct.
I haven't fully studied what the returned data represents, but I
suppose this could be
a way to fix that. In my previous reply to Daniel I pointed out a way
we may achieve
a flush behavior with poll() too, but it could be a little hacky.

>
> Some other side notes that reached my mind..
>
> The qio_channel_writev_full() may not be suitable for async operations, as the
> name "full" implies synchronous to me.  So maybe we can add a new helper for
> zero copy on the channel?
>
> We may also want a new QIOChannelFeature as QIO_CHANNEL_FEATURE_ZEROCOPY, then
> we fail qio_channel_writv_zerocopy() (or whatever name we come up with) if that
> bit is not set in qio channel features.

I also suggested something like that, but I thought it could be good if we could
fall back to io_writev() if we didn't have the zerocopy feature (or
the async feature).

What do you think?

>
> Thanks,
>
> --
> Peter Xu
>

I really appreciate your suggestions, thanks Peter!

Best regards,
Leonardo
Leonardo Bras Sept. 2, 2021, 7:07 a.m. UTC | #8
Hello Daniel,.
A few more comments:

On Wed, Sep 1, 2021 at 5:51 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote:
> > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > >
> > > > To make it work, three steps are needed:
> > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > 3 - Process the socket's error queue, dealing with any error
> > >
> > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > >
> > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > from a synchronous call to an asynchronous call.
> > >
> > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > until an asynchronous completion notification has been received from
> > > the socket error queue. These notifications are not required to
> > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > to the buffer if a re-transmit is needed.
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "Page pinning also changes system call semantics. It temporarily
> > >    shares the buffer between process and network stack. Unlike with
> > >    copying, the process cannot immediately overwrite the buffer
> > >    after system call return without possibly modifying the data in
> > >    flight. Kernel integrity is not affected, but a buggy program
> > >    can possibly corrupt its own data stream."
> > >
> > > AFAICT, the design added in this patch does not provide any way
> > > to honour these requirements around buffer lifetime.
> > >
> > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > way. The buffer lifetime requirements imply need for an API
> > > design that is fundamentally different for asynchronous usage,
> > > with a callback to notify when the write has finished/failed.
> >
> > Regarding buffer reuse - it indeed has a very deep implication on the buffer
> > being available and it's not obvious at all.  Just to mention that the initial
> > user of this work will make sure all zero copy buffers will be guest pages only
> > (as it's only used in multi-fd), so they should always be there during the
> > process.
>
> That is not the case when migration is using TLS, because the buffers
> transmitted are the ciphertext buffer, not the plaintext guest page.

That makes sense.
I am still working my way on Migration code, and I ended up not
knowing that part.

I think we can work on KTLS for this case, and implement something
'like zerocopy' for this. I added more details in the previous mail I sent you.


>
> > In short, we may just want to at least having a way to make sure all zero
> > copied buffers are finished using and they're sent after some function returns
> > (e.g., qio_channel_flush()).  That may require us to do some accounting on when
> > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> > error queue and keep those information somewhere too.
> >
> > Some other side notes that reached my mind..
> >
> > The qio_channel_writev_full() may not be suitable for async operations, as the
> > name "full" implies synchronous to me.  So maybe we can add a new helper for
> > zero copy on the channel?
>
> All the APIs that exist today are fundamentally only suitable for sync
> operations. Supporting async correctly will definitely a brand new APIs
> separate from what exists today.

That's a great, I was thinking on implementing this as an optional method for
QIOChannel, more details in the previous mail :).

(sorry, I ended up reading/replying the mails in thread order)

>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Best regards,
Leonardo Bras
Daniel P. Berrangé Sept. 2, 2021, 8:47 a.m. UTC | #9
On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> Hello Daniel, thank you for the feedback!
> 
> Comments inline.
> 
> On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > send calls. It does so by avoiding copying user data into kernel buffers.
> > >
> > > To make it work, three steps are needed:
> > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > 3 - Process the socket's error queue, dealing with any error
> >
> > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> >
> > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > from a synchronous call to an asynchronous call.
> 
> You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in
> a somehow synchronous way, but returning errp (and sometimes closing the
> channel because of it) was a poor implementation.
> 
> >
> > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > until an asynchronous completion notification has been received from
> > the socket error queue. These notifications are not required to
> > arrive in-order, even for a TCP stream, because the kernel hangs on
> > to the buffer if a re-transmit is needed.
> >
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >
> >   "Page pinning also changes system call semantics. It temporarily
> >    shares the buffer between process and network stack. Unlike with
> >    copying, the process cannot immediately overwrite the buffer
> >    after system call return without possibly modifying the data in
> >    flight. Kernel integrity is not affected, but a buggy program
> >    can possibly corrupt its own data stream."
> >
> 
> By the above piece of documentation, I get there is no problem in
> overwriting the buffer, but a corrupt, or newer version of the memory may
> be sent instead of the original one. I am pointing this out because there
> are workloads like page migration that would not be impacted, given
> once the buffer is changed, it will dirty the page and it will be re-sent.

The idea of having the buffer overwritten is still seriously worrying
me. I get the idea that in theory the "worst" that should happen is
that we unexpectedly transmit newer guest memory contents. There are
so many edge cases in migration code though that I'm worried there
might be scenarios in which that is still going to cause problems.

The current migration code has a synchronization point at the end of
each iteration of the migration loop. At the end of each iteration,
the source has a guarantee that all pages from the dirty bitmap have
now been sent. With the ZEROCOPY feature we have entirely removed all
synchronization points from the source host wrt memory sending. At
best we know that the pages will have been sent if we get to close()
without seeing errors, unless we're going to explicitly track the
completion messages. The TCP re-transmit semantics are especially
worrying to me, because the re-transmit is liable to send different
data than was in the original lost TCP packet.

Maybe everything is still safe because TCP is a reliable ordered
stream, and thus eventually the dst will get the newest guest
page contents. I'm still very worried that we have code in the
source that implicitly relies on there being some synchronization
points that we've effectively removed.

> > AFAICT, the design added in this patch does not provide any way
> > to honour these requirements around buffer lifetime.
> >
> > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > way. The buffer lifetime requirements imply need for an API
> > design that is fundamentally different for asynchronous usage,
> > with a callback to notify when the write has finished/failed.
> 
> That is a good point.
> Proposing a new optional method like io_async_writev() + an error
> checking mechanism could do the job.
> io_async_writev() could fall-back to io_writev() in cases where it's not
> implemented.
> 
> I am not sure about the error checking yet.
> Options I can see are:
> 1 - A callback, as you suggested, which IIUC would be provided by
> code using the QIOChannel, and would only fix the reported errors,
> leaving the responsibility of checking for errors to the IOChannel code.

A callback is fairly simple, but potentially limits performance. Reading
the kernel docs it seems they intentionally merge notifications to enable
a whole contiguous set to be processed in one go. A callback would
effectively be unmerging them requiring processing one a time. Not
sure how much this matters to our usage.

> 2 - A new method, maybe io_async_errck(), which would be called
> whenever the using code wants to deal with pending errors. It could
> return an array/list of IOVs that need to be re-sent, for example,
> and code using QIOChannel could deal with it however it wants.

Considering that we're using TCP, we get a reliable, ordered data
stream. So there should never be a failure that requires use to
know specific IOVs to re-sent - the kernel still handles TCP
re-transmit - we just have to still have the buffer available.
As noted above though, the re-transmit is liable to send different
data than the original transmit, which worries me.

So the only errors we should get from the completion notifications
would be errors that are completely fatal for the TCP stream. So
for errors, we only need to know whether an error has ocurred or
not. The second reason for the completion notifications is for
providing a synchronization, to know when the buffer can be released
or overwritten. That is the only scenario we need to know list of
IOVs that completed.



> > > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> > >                           "Unable to write to socket");
> > >          return -1;
> > >      }
> > > +
> > > +    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
> > > +        sioc->errq_pending += niov;
> > > +        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
> > > +            qio_channel_socket_errq_proc(sioc, errp);
> > > +        }
> > > +    }
> >
> > This silently looses any errors set from upto the final
> > SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set.
> 
> You are right.
> 
> >
> > Also means if you have a series of writes without
> > MSG_ZEROCOPY, it'll delay checking any pending
> > errors.
> 
> That's expected... if there are only happening sends without MSG_ZEROCOPY,
> it means the ones sent with zerocopy can wait. The problem would be
> the above case.

Well it depends whether there are synchonization requirements in the
caller. For example, current migration code knows that at the end of
each iteration sending pages, that the data has all actally been
sent. With this new logic, we might have done several more iterations
of the migration loop before the data for the original iteration has
been sent. The writes() for the original iteration may well be sending
the data from the later iterations. Possibly this is ok, but it is
a clear difference in the data that will actually go out on the wire
with this code.

> > I would suggest checkig in close(), but as mentioned
> > earlier, I think the design is flawed because the caller
> > fundamentally needs to know about completion for every
> > single write they make in order to know when the buffer
> > can be released / reused.
> 
> Well, there could be a flush mechanism (maybe in io_sync_errck(),
> activated with a
> parameter flag, or on a different method if callback is preferred):
> In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> syscall after each packet sent, and this means the fd gets a signal after each
> sendmsg() happens, with error or not.
> 
> We could harness this with a poll() and a relatively high timeout:
> - We stop sending packets, and then call poll().
> - Whenever poll() returns 0, it means a timeout happened, and so it
> took too long
> without sendmsg() happening, meaning all the packets are sent.
> - If it returns anything else, we go back to fixing the errors found (re-send)
> 
> The problem may be defining the value of this timeout, but it could be
> called only
> when zerocopy is active.

Maybe we need to check completions at the end of each iteration of the
migration dirtypage loop ?


> > > +static void
> > > +qio_channel_socket_set_zerocopy(QIOChannel *ioc,
> > > +                                bool enabled)
> > > +{
> > > +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> > > +    int v = enabled ? 1 : 0;
> > > +    int ret;
> > > +
> > > +    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> > > +    if (ret >= 0) {
> > > +        sioc->zerocopy_enabled = true;
> > > +    }
> > > +}
> >
> > Surely we need to tell the caller wether this succeeed, otherwise
> > the later sendmsg() is going to fail with EINVAL on older kernels
> > where MSG_ZEROCOPY is not supported.
> 
> Yeah, that was the idea on sioc->zerocopy_enabled, but in fact it
> should have been
> something like 'sioc->zerocopy_enabled = (ret >= 0) && enabled ', this
> way it would
> reflect zerocopy only if it was enabled and the syscall worked, and not_zerocopy
> otherwise.
> 
> or something like this, if we want it to stick with zerocopy if
> setting it off fails.
> if (ret >= 0) {
>     sioc->zerocopy_enabled = enabled;
> }

Yes, that is a bug fix we need, but actually I was refering
to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY
from 'flags', if zerocopy_enabled is not set, to avoid EINVAL
from sendmsg.

> > > diff --git a/io/channel-tls.c b/io/channel-tls.c
> > > index 4ce890a538..bf44b0f7b0 100644
> > > --- a/io/channel-tls.c
> > > +++ b/io/channel-tls.c
> > > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
> > >      qio_channel_set_delay(tioc->master, enabled);
> > >  }
> > >
> > > +
> > > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
> > > +                                         bool enabled)
> > > +{
> > > +    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
> > > +
> > > +    qio_channel_set_zerocopy(tioc->master, enabled);
> > > +}
> >
> > This is going to be unsafe because gnutls will discard/reuse the
> > buffer for the ciphertext after every write(). I can't see a
> > way to safely enable MSG_ZEROCOPY when TLS is used.
> 
> Yeah, that makes sense.
> It would make more sense to implement KTLS, as IIRC it kind of does
> 'zerocopy', since it saves the encrypted data directly in kernel buffer.

I would hope it is zerocopy, in so much as the kernel can directly
use the userspace pages as the plaintext, and then the ciphertext
in kernel space can be sent directly without further copies.

What i'm not sure is whether this means it also becomes an effectively
asynchronous API for transmitting data. ie does it have the same
constraints about needing to lock pages, and not modify data in the
pages? I've not investigated it in any detail, but I don't recall
that being mentioned, and if it was the case, it would be impossible
to enable that transparently in gnutls as its a major semantic changed.

> We could implement KTLS as io_async_writev() for Channel_TLS, and change this
> flag to async_enabled. If KTLS is not available, it would fall back to
> using gnuTLS on io_writev, just like it would happen in zerocopy.

If gnutls is going to support KTLS in a way we can use, I dont want to
have any of that duplicated in QEMU code. I want to be able delegate
as much as possible to gnutls, at least so that QEMU isn't in the loop
for any crypto sensitive parts, as that complicates certification
of crypto.

Regards,
Daniel
Leonardo Bras Sept. 2, 2021, 9:34 a.m. UTC | #10
On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel, thank you for the feedback!
> >
> > Comments inline.
> >
> > On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > >
> > > > To make it work, three steps are needed:
> > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > 3 - Process the socket's error queue, dealing with any error
> > >
> > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > >
> > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > from a synchronous call to an asynchronous call.
> >
> > You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in
> > a somehow synchronous way, but returning errp (and sometimes closing the
> > channel because of it) was a poor implementation.
> >
> > >
> > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > until an asynchronous completion notification has been received from
> > > the socket error queue. These notifications are not required to
> > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > to the buffer if a re-transmit is needed.
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "Page pinning also changes system call semantics. It temporarily
> > >    shares the buffer between process and network stack. Unlike with
> > >    copying, the process cannot immediately overwrite the buffer
> > >    after system call return without possibly modifying the data in
> > >    flight. Kernel integrity is not affected, but a buggy program
> > >    can possibly corrupt its own data stream."
> > >
> >
> > By the above piece of documentation, I get there is no problem in
> > overwriting the buffer, but a corrupt, or newer version of the memory may
> > be sent instead of the original one. I am pointing this out because there
> > are workloads like page migration that would not be impacted, given
> > once the buffer is changed, it will dirty the page and it will be re-sent.
>
> The idea of having the buffer overwritten is still seriously worrying
> me. I get the idea that in theory the "worst" that should happen is
> that we unexpectedly transmit newer guest memory contents. There are
> so many edge cases in migration code though that I'm worried there
> might be scenarios in which that is still going to cause problems.

I agree we should keep a eye on that, maybe have David Gilbert's opinion
on that.

>
> The current migration code has a synchronization point at the end of
> each iteration of the migration loop. At the end of each iteration,
> the source has a guarantee that all pages from the dirty bitmap have
> now been sent. With the ZEROCOPY feature we have entirely removed all
> synchronization points from the source host wrt memory sending. At
> best we know that the pages will have been sent if we get to close()
> without seeing errors, unless we're going to explicitly track the
> completion messages. The TCP re-transmit semantics are especially
> worrying to me, because the re-transmit is liable to send different
> data than was in the original lost TCP packet.
>
> Maybe everything is still safe because TCP is a reliable ordered
> stream, and thus eventually the dst will get the newest guest
> page contents. I'm still very worried that we have code in the
> source that implicitly relies on there being some synchronization
> points that we've effectively removed.
>
> > > AFAICT, the design added in this patch does not provide any way
> > > to honour these requirements around buffer lifetime.
> > >
> > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > way. The buffer lifetime requirements imply need for an API
> > > design that is fundamentally different for asynchronous usage,
> > > with a callback to notify when the write has finished/failed.
> >
> > That is a good point.
> > Proposing a new optional method like io_async_writev() + an error
> > checking mechanism could do the job.
> > io_async_writev() could fall-back to io_writev() in cases where it's not
> > implemented.
> >
> > I am not sure about the error checking yet.
> > Options I can see are:
> > 1 - A callback, as you suggested, which IIUC would be provided by
> > code using the QIOChannel, and would only fix the reported errors,
> > leaving the responsibility of checking for errors to the IOChannel code.
>
> A callback is fairly simple, but potentially limits performance. Reading
> the kernel docs it seems they intentionally merge notifications to enable
> a whole contiguous set to be processed in one go. A callback would
> effectively be unmerging them requiring processing one a time. Not
> sure how much this matters to our usage.
>
> > 2 - A new method, maybe io_async_errck(), which would be called
> > whenever the using code wants to deal with pending errors. It could
> > return an array/list of IOVs that need to be re-sent, for example,
> > and code using QIOChannel could deal with it however it wants.
>
> Considering that we're using TCP, we get a reliable, ordered data
> stream. So there should never be a failure that requires use to
> know specific IOVs to re-sent - the kernel still handles TCP
> re-transmit - we just have to still have the buffer available.
> As noted above though, the re-transmit is liable to send different
> data than the original transmit, which worries me.
>
> So the only errors we should get from the completion notifications
> would be errors that are completely fatal for the TCP stream. So
> for errors, we only need to know whether an error has ocurred or
> not. The second reason for the completion notifications is for
> providing a synchronization, to know when the buffer can be released
> or overwritten. That is the only scenario we need to know list of
> IOVs that completed.
>
>
>
> > > > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> > > >                           "Unable to write to socket");
> > > >          return -1;
> > > >      }
> > > > +
> > > > +    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
> > > > +        sioc->errq_pending += niov;
> > > > +        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
> > > > +            qio_channel_socket_errq_proc(sioc, errp);
> > > > +        }
> > > > +    }
> > >
> > > This silently looses any errors set from upto the final
> > > SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set.
> >
> > You are right.
> >
> > >
> > > Also means if you have a series of writes without
> > > MSG_ZEROCOPY, it'll delay checking any pending
> > > errors.
> >
> > That's expected... if there are only happening sends without MSG_ZEROCOPY,
> > it means the ones sent with zerocopy can wait. The problem would be
> > the above case.
>
> Well it depends whether there are synchonization requirements in the
> caller. For example, current migration code knows that at the end of
> each iteration sending pages, that the data has all actally been
> sent. With this new logic, we might have done several more iterations
> of the migration loop before the data for the original iteration has
> been sent. The writes() for the original iteration may well be sending
> the data from the later iterations. Possibly this is ok, but it is
> a clear difference in the data that will actually go out on the wire
> with this code.
>
> > > I would suggest checkig in close(), but as mentioned
> > > earlier, I think the design is flawed because the caller
> > > fundamentally needs to know about completion for every
> > > single write they make in order to know when the buffer
> > > can be released / reused.
> >
> > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > activated with a
> > parameter flag, or on a different method if callback is preferred):
> > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> > syscall after each packet sent, and this means the fd gets a signal after each
> > sendmsg() happens, with error or not.
> >
> > We could harness this with a poll() and a relatively high timeout:
> > - We stop sending packets, and then call poll().
> > - Whenever poll() returns 0, it means a timeout happened, and so it
> > took too long
> > without sendmsg() happening, meaning all the packets are sent.
> > - If it returns anything else, we go back to fixing the errors found (re-send)
> >
> > The problem may be defining the value of this timeout, but it could be
> > called only
> > when zerocopy is active.
>
> Maybe we need to check completions at the end of each iteration of the
> migration dirtypage loop ?

Sorry, I am really new to this, and I still couldn't understand why would we
need to check at the end of each iteration, instead of doing a full check at the
end.

Is that a case of a fatal error on TCP stream, in which a lot of packets
were reported as 'sent' but we may lose track of which were in fact sent?

Would the poll() approach above be enough for a 'flush()' ?

>
>
> > > > +static void
> > > > +qio_channel_socket_set_zerocopy(QIOChannel *ioc,
> > > > +                                bool enabled)
> > > > +{
> > > > +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> > > > +    int v = enabled ? 1 : 0;
> > > > +    int ret;
> > > > +
> > > > +    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> > > > +    if (ret >= 0) {
> > > > +        sioc->zerocopy_enabled = true;
> > > > +    }
> > > > +}
> > >
> > > Surely we need to tell the caller wether this succeeed, otherwise
> > > the later sendmsg() is going to fail with EINVAL on older kernels
> > > where MSG_ZEROCOPY is not supported.
> >
> > Yeah, that was the idea on sioc->zerocopy_enabled, but in fact it
> > should have been
> > something like 'sioc->zerocopy_enabled = (ret >= 0) && enabled ', this
> > way it would
> > reflect zerocopy only if it was enabled and the syscall worked, and not_zerocopy
> > otherwise.
> >
> > or something like this, if we want it to stick with zerocopy if
> > setting it off fails.
> > if (ret >= 0) {
> >     sioc->zerocopy_enabled = enabled;
> > }
>
> Yes, that is a bug fix we need, but actually I was refering
> to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY
> from 'flags', if zerocopy_enabled is not set, to avoid EINVAL
> from sendmsg.

Agree, something like io_writev(,, sioc->zerocopy_enabled ?
MSG_ZEROCOPY : 0, errp)'
should do, right?
(or an io_async_writev(), that will fall_back to io_writev() if
zerocopy is disabled)

>
> > > > diff --git a/io/channel-tls.c b/io/channel-tls.c
> > > > index 4ce890a538..bf44b0f7b0 100644
> > > > --- a/io/channel-tls.c
> > > > +++ b/io/channel-tls.c
> > > > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
> > > >      qio_channel_set_delay(tioc->master, enabled);
> > > >  }
> > > >
> > > > +
> > > > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
> > > > +                                         bool enabled)
> > > > +{
> > > > +    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
> > > > +
> > > > +    qio_channel_set_zerocopy(tioc->master, enabled);
> > > > +}
> > >
> > > This is going to be unsafe because gnutls will discard/reuse the
> > > buffer for the ciphertext after every write(). I can't see a
> > > way to safely enable MSG_ZEROCOPY when TLS is used.
> >
> > Yeah, that makes sense.
> > It would make more sense to implement KTLS, as IIRC it kind of does
> > 'zerocopy', since it saves the encrypted data directly in kernel buffer.
>
> I would hope it is zerocopy, in so much as the kernel can directly
> use the userspace pages as the plaintext, and then the ciphertext
> in kernel space can be sent directly without further copies.
>
> What i'm not sure is whether this means it also becomes an effectively
> asynchronous API for transmitting data. ie does it have the same
> constraints about needing to lock pages, and not modify data in the
> pages? I've not investigated it in any detail, but I don't recall
> that being mentioned, and if it was the case, it would be impossible
> to enable that transparently in gnutls as its a major semantic changed.

I think it somehow waits for the user buffer to be encrypted to the kernel
buffer, and then returns, behaving like a normal sendmsg().
(except if it's using kernel async cripto API,  which I don't think is the case)

>
> > We could implement KTLS as io_async_writev() for Channel_TLS, and change this
> > flag to async_enabled. If KTLS is not available, it would fall back to
> > using gnuTLS on io_writev, just like it would happen in zerocopy.
>
> If gnutls is going to support KTLS in a way we can use, I dont want to
> have any of that duplicated in QEMU code. I want to be able delegate
> as much as possible to gnutls, at least so that QEMU isn't in the loop
> for any crypto sensitive parts, as that complicates certification
> of crypto.

Yeah, that's a very good argument :)
I think it's probably the case to move from the current callback implementation
to the implementation in which we give gnuTLS the file descriptor.

( I was thinking on implementing an simpler direct gnuTLS version only
for the 'zerocopy' QIOChannel_TLS, but I think it would not make much sense.)
Daniel P. Berrangé Sept. 2, 2021, 9:49 a.m. UTC | #11
On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> >
> > > > I would suggest checkig in close(), but as mentioned
> > > > earlier, I think the design is flawed because the caller
> > > > fundamentally needs to know about completion for every
> > > > single write they make in order to know when the buffer
> > > > can be released / reused.
> > >
> > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > activated with a
> > > parameter flag, or on a different method if callback is preferred):
> > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> > > syscall after each packet sent, and this means the fd gets a signal after each
> > > sendmsg() happens, with error or not.
> > >
> > > We could harness this with a poll() and a relatively high timeout:
> > > - We stop sending packets, and then call poll().
> > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > took too long
> > > without sendmsg() happening, meaning all the packets are sent.
> > > - If it returns anything else, we go back to fixing the errors found (re-send)
> > >
> > > The problem may be defining the value of this timeout, but it could be
> > > called only
> > > when zerocopy is active.
> >
> > Maybe we need to check completions at the end of each iteration of the
> > migration dirtypage loop ?
> 
> Sorry, I am really new to this, and I still couldn't understand why would we
> need to check at the end of each iteration, instead of doing a full check at the
> end.

The end of each iteration is an implicit synchronization point in the
current migration code.

For example, we might do 2 iterations of migration pre-copy, and then
switch to post-copy mode. If the data from those 2 iterations hasn't
been sent at the point we switch to post-copy, that is a semantic
change from current behaviour. I don't know if that will have an
problematic effect on the migration process, or not. Checking the
async completions at the end of each iteration though, would ensure
the semantics similar to current semantics, reducing risk of unexpected
problems.


> > > or something like this, if we want it to stick with zerocopy if
> > > setting it off fails.
> > > if (ret >= 0) {
> > >     sioc->zerocopy_enabled = enabled;
> > > }
> >
> > Yes, that is a bug fix we need, but actually I was refering
> > to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY
> > from 'flags', if zerocopy_enabled is not set, to avoid EINVAL
> > from sendmsg.
> 
> Agree, something like io_writev(,, sioc->zerocopy_enabled ?
> MSG_ZEROCOPY : 0, errp)'
> should do, right?
> (or an io_async_writev(), that will fall_back to io_writev() if
> zerocopy is disabled)

Something like that - depends what API we end up deciding on

> > > We could implement KTLS as io_async_writev() for Channel_TLS, and change this
> > > flag to async_enabled. If KTLS is not available, it would fall back to
> > > using gnuTLS on io_writev, just like it would happen in zerocopy.
> >
> > If gnutls is going to support KTLS in a way we can use, I dont want to
> > have any of that duplicated in QEMU code. I want to be able delegate
> > as much as possible to gnutls, at least so that QEMU isn't in the loop
> > for any crypto sensitive parts, as that complicates certification
> > of crypto.
> 
> Yeah, that's a very good argument :)
> I think it's probably the case to move from the current callback implementation
> to the implementation in which we give gnuTLS the file descriptor.

That would introduce a coupling  between the two QIOChannel impls
though, which is avoided so far, because we didn't want an assuption
that a QIOChannel == a FD.


Regards,
Daniel
Leonardo Bras Sept. 2, 2021, 10:19 a.m. UTC | #12
On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> > >
> > > > > I would suggest checkig in close(), but as mentioned
> > > > > earlier, I think the design is flawed because the caller
> > > > > fundamentally needs to know about completion for every
> > > > > single write they make in order to know when the buffer
> > > > > can be released / reused.
> > > >
> > > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > > activated with a
> > > > parameter flag, or on a different method if callback is preferred):
> > > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> > > > syscall after each packet sent, and this means the fd gets a signal after each
> > > > sendmsg() happens, with error or not.
> > > >
> > > > We could harness this with a poll() and a relatively high timeout:
> > > > - We stop sending packets, and then call poll().
> > > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > > took too long
> > > > without sendmsg() happening, meaning all the packets are sent.
> > > > - If it returns anything else, we go back to fixing the errors found (re-send)
> > > >
> > > > The problem may be defining the value of this timeout, but it could be
> > > > called only
> > > > when zerocopy is active.
> > >
> > > Maybe we need to check completions at the end of each iteration of the
> > > migration dirtypage loop ?
> >
> > Sorry, I am really new to this, and I still couldn't understand why would we
> > need to check at the end of each iteration, instead of doing a full check at the
> > end.
>
> The end of each iteration is an implicit synchronization point in the
> current migration code.
>
> For example, we might do 2 iterations of migration pre-copy, and then
> switch to post-copy mode. If the data from those 2 iterations hasn't
> been sent at the point we switch to post-copy, that is a semantic
> change from current behaviour. I don't know if that will have an
> problematic effect on the migration process, or not. Checking the
> async completions at the end of each iteration though, would ensure
> the semantics similar to current semantics, reducing risk of unexpected
> problems.
>

What if we do the 'flush()' before we start post-copy, instead of after each
iteration? would that be enough?

>
> > > > or something like this, if we want it to stick with zerocopy if
> > > > setting it off fails.
> > > > if (ret >= 0) {
> > > >     sioc->zerocopy_enabled = enabled;
> > > > }
> > >
> > > Yes, that is a bug fix we need, but actually I was refering
> > > to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY
> > > from 'flags', if zerocopy_enabled is not set, to avoid EINVAL
> > > from sendmsg.
> >
> > Agree, something like io_writev(,, sioc->zerocopy_enabled ?
> > MSG_ZEROCOPY : 0, errp)'
> > should do, right?
> > (or an io_async_writev(), that will fall_back to io_writev() if
> > zerocopy is disabled)
>
> Something like that - depends what API we end up deciding on

Great :)

>
> > > > We could implement KTLS as io_async_writev() for Channel_TLS, and change this
> > > > flag to async_enabled. If KTLS is not available, it would fall back to
> > > > using gnuTLS on io_writev, just like it would happen in zerocopy.
> > >
> > > If gnutls is going to support KTLS in a way we can use, I dont want to
> > > have any of that duplicated in QEMU code. I want to be able delegate
> > > as much as possible to gnutls, at least so that QEMU isn't in the loop
> > > for any crypto sensitive parts, as that complicates certification
> > > of crypto.
> >
> > Yeah, that's a very good argument :)
> > I think it's probably the case to move from the current callback implementation
> > to the implementation in which we give gnuTLS the file descriptor.
>
> That would introduce a coupling  between the two QIOChannel impls
> though, which is avoided so far, because we didn't want an assuption
> that a QIOChannel == a FD.

Interesting, QIOChannelTLS would necessarily need to have a QIOChannelSocket,
is that right?

Maybe we can get a QIOChannelKTLS, which would use gnuTLS but get to
have it's own fd,
but that possibly would cause a lot of duplicated code.
On the other hand, this way the QIOChannelTLS would still be able to
accept any other
channel, if desired.

Anyway, it's a complicated decision :)

> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Thank you, I am learning a lot today :)

Best regards,
Leonardo
Daniel P. Berrangé Sept. 2, 2021, 10:28 a.m. UTC | #13
On Thu, Sep 02, 2021 at 07:19:58AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote:
> > > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> > > >
> > > > > > I would suggest checkig in close(), but as mentioned
> > > > > > earlier, I think the design is flawed because the caller
> > > > > > fundamentally needs to know about completion for every
> > > > > > single write they make in order to know when the buffer
> > > > > > can be released / reused.
> > > > >
> > > > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > > > activated with a
> > > > > parameter flag, or on a different method if callback is preferred):
> > > > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> > > > > syscall after each packet sent, and this means the fd gets a signal after each
> > > > > sendmsg() happens, with error or not.
> > > > >
> > > > > We could harness this with a poll() and a relatively high timeout:
> > > > > - We stop sending packets, and then call poll().
> > > > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > > > took too long
> > > > > without sendmsg() happening, meaning all the packets are sent.
> > > > > - If it returns anything else, we go back to fixing the errors found (re-send)
> > > > >
> > > > > The problem may be defining the value of this timeout, but it could be
> > > > > called only
> > > > > when zerocopy is active.
> > > >
> > > > Maybe we need to check completions at the end of each iteration of the
> > > > migration dirtypage loop ?
> > >
> > > Sorry, I am really new to this, and I still couldn't understand why would we
> > > need to check at the end of each iteration, instead of doing a full check at the
> > > end.
> >
> > The end of each iteration is an implicit synchronization point in the
> > current migration code.
> >
> > For example, we might do 2 iterations of migration pre-copy, and then
> > switch to post-copy mode. If the data from those 2 iterations hasn't
> > been sent at the point we switch to post-copy, that is a semantic
> > change from current behaviour. I don't know if that will have an
> > problematic effect on the migration process, or not. Checking the
> > async completions at the end of each iteration though, would ensure
> > the semantics similar to current semantics, reducing risk of unexpected
> > problems.
> >
> 
> What if we do the 'flush()' before we start post-copy, instead of after each
> iteration? would that be enough?

Possibly, yes. This really need David G's input since he understands
the code in way more detail than me.


Regards,
Daniel
Dr. David Alan Gilbert Sept. 7, 2021, 11:06 a.m. UTC | #14
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Thu, Sep 02, 2021 at 07:19:58AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote:
> > > > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> > > > >
> > > > > > > I would suggest checkig in close(), but as mentioned
> > > > > > > earlier, I think the design is flawed because the caller
> > > > > > > fundamentally needs to know about completion for every
> > > > > > > single write they make in order to know when the buffer
> > > > > > > can be released / reused.
> > > > > >
> > > > > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > > > > activated with a
> > > > > > parameter flag, or on a different method if callback is preferred):
> > > > > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> > > > > > syscall after each packet sent, and this means the fd gets a signal after each
> > > > > > sendmsg() happens, with error or not.
> > > > > >
> > > > > > We could harness this with a poll() and a relatively high timeout:
> > > > > > - We stop sending packets, and then call poll().
> > > > > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > > > > took too long
> > > > > > without sendmsg() happening, meaning all the packets are sent.
> > > > > > - If it returns anything else, we go back to fixing the errors found (re-send)
> > > > > >
> > > > > > The problem may be defining the value of this timeout, but it could be
> > > > > > called only
> > > > > > when zerocopy is active.
> > > > >
> > > > > Maybe we need to check completions at the end of each iteration of the
> > > > > migration dirtypage loop ?
> > > >
> > > > Sorry, I am really new to this, and I still couldn't understand why would we
> > > > need to check at the end of each iteration, instead of doing a full check at the
> > > > end.
> > >
> > > The end of each iteration is an implicit synchronization point in the
> > > current migration code.
> > >
> > > For example, we might do 2 iterations of migration pre-copy, and then
> > > switch to post-copy mode. If the data from those 2 iterations hasn't
> > > been sent at the point we switch to post-copy, that is a semantic
> > > change from current behaviour. I don't know if that will have an
> > > problematic effect on the migration process, or not. Checking the
> > > async completions at the end of each iteration though, would ensure
> > > the semantics similar to current semantics, reducing risk of unexpected
> > > problems.
> > >
> > 
> > What if we do the 'flush()' before we start post-copy, instead of after each
> > iteration? would that be enough?
> 
> Possibly, yes. This really need David G's input since he understands
> the code in way more detail than me.

Hmm I'm not entirely sure why we have the sync after each iteration;
the case I can think of is if we're doing async sending, we could have
two versions of the same page in flight (one from each iteration) -
you'd want those to get there in the right order.

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Peter Xu Sept. 7, 2021, 4:44 p.m. UTC | #15
On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote:
> I also suggested something like that, but I thought it could be good if we could
> fall back to io_writev() if we didn't have the zerocopy feature (or
> the async feature).
> 
> What do you think?

That fallback looks safe and ok, I'm just not sure whether it'll be of great
help.  E.g. if we provide an QIO api that allows both sync write and zero-copy
write (then we do the fallback when necessary), it means the buffer implication
applies too to this api, so it's easier to me to just detect the zero copy
capability and use one alternative.  Thanks,
Peter Xu Sept. 7, 2021, 6:09 p.m. UTC | #16
On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > What if we do the 'flush()' before we start post-copy, instead of after each
> > > iteration? would that be enough?
> > 
> > Possibly, yes. This really need David G's input since he understands
> > the code in way more detail than me.
> 
> Hmm I'm not entirely sure why we have the sync after each iteration;

We don't have that yet, do we?

> the case I can think of is if we're doing async sending, we could have
> two versions of the same page in flight (one from each iteration) -
> you'd want those to get there in the right order.

From MSG_ZEROCOPY document:

        For protocols that acknowledge data in-order, like TCP, each
        notification can be squashed into the previous one, so that no more
        than one notification is outstanding at any one point.

        Ordered delivery is the common case, but not guaranteed. Notifications
        may arrive out of order on retransmission and socket teardown.

So no matter whether we have a flush() per iteration before, it seems we may
want one when zero copy enabled?

Thanks,
Dr. David Alan Gilbert Sept. 8, 2021, 8:30 a.m. UTC | #17
* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > > What if we do the 'flush()' before we start post-copy, instead of after each
> > > > iteration? would that be enough?
> > > 
> > > Possibly, yes. This really need David G's input since he understands
> > > the code in way more detail than me.
> > 
> > Hmm I'm not entirely sure why we have the sync after each iteration;
> 
> We don't have that yet, do we?

I think multifd does; I think it calls multifd_send_sync_main sometimes,
which I *think* corresponds to iterations.

Dave

> > the case I can think of is if we're doing async sending, we could have
> > two versions of the same page in flight (one from each iteration) -
> > you'd want those to get there in the right order.
> 
> From MSG_ZEROCOPY document:
> 
>         For protocols that acknowledge data in-order, like TCP, each
>         notification can be squashed into the previous one, so that no more
>         than one notification is outstanding at any one point.
> 
>         Ordered delivery is the common case, but not guaranteed. Notifications
>         may arrive out of order on retransmission and socket teardown.
> 
> So no matter whether we have a flush() per iteration before, it seems we may
> want one when zero copy enabled?
> 
> Thanks,
> 
> -- 
> Peter Xu
>
Peter Xu Sept. 8, 2021, 3:24 p.m. UTC | #18
On Wed, Sep 08, 2021 at 09:30:58AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > > > What if we do the 'flush()' before we start post-copy, instead of after each
> > > > > iteration? would that be enough?
> > > > 
> > > > Possibly, yes. This really need David G's input since he understands
> > > > the code in way more detail than me.
> > > 
> > > Hmm I'm not entirely sure why we have the sync after each iteration;
> > 
> > We don't have that yet, do we?
> 
> I think multifd does; I think it calls multifd_send_sync_main sometimes,
> which I *think* corresponds to iterations.

Oh.. Then we may need to:

  (1) Make that sync work for zero-copy too; say, semaphores may not be enough
      with it - we need to make sure all async buffers are consumed by checking
      the error queue of the sockets,

  (2) When we make zero-copy work without multi-fd, we may want some similar
      sync on the main stream too

Thanks,

> 
> Dave
> 
> > > the case I can think of is if we're doing async sending, we could have
> > > two versions of the same page in flight (one from each iteration) -
> > > you'd want those to get there in the right order.
> > 
> > From MSG_ZEROCOPY document:
> > 
> >         For protocols that acknowledge data in-order, like TCP, each
> >         notification can be squashed into the previous one, so that no more
> >         than one notification is outstanding at any one point.
> > 
> >         Ordered delivery is the common case, but not guaranteed. Notifications
> >         may arrive out of order on retransmission and socket teardown.
> > 
> > So no matter whether we have a flush() per iteration before, it seems we may
> > want one when zero copy enabled?
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> > 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Leonardo Bras Sept. 8, 2021, 8:13 p.m. UTC | #19
On Tue, Sep 7, 2021 at 1:44 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote:
> > I also suggested something like that, but I thought it could be good if we could
> > fall back to io_writev() if we didn't have the zerocopy feature (or
> > the async feature).
> >
> > What do you think?
>
> That fallback looks safe and ok, I'm just not sure whether it'll be of great
> help.  E.g. if we provide an QIO api that allows both sync write and zero-copy
> write (then we do the fallback when necessary), it means the buffer implication
> applies too to this api, so it's easier to me to just detect the zero copy
> capability and use one alternative.  Thanks,
>
> --
> Peter Xu
>

I was thinking this way (async method with fallback) we would allow code using
the QIO api to just try to use the io_async_writev() whenever the code
seems fit, and
let the QIO channel layer to decide when it can be used
(implementation provides,
features available), and just fallback to io_writev() when it can't be used.

Best regards,
Leo
Leonardo Bras Sept. 8, 2021, 8:25 p.m. UTC | #20
On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> > Possibly, yes. This really need David G's input since he understands
> > the code in way more detail than me.
>
> Hmm I'm not entirely sure why we have the sync after each iteration;
> the case I can think of is if we're doing async sending, we could have
> two versions of the same page in flight (one from each iteration) -
> you'd want those to get there in the right order.
>
> Dave

Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY,  we will in
fact have the same page in flight twice, instead of two versions,
given the buffer is
sent as it is during transmission.

For me it looks like there will be no change in the current algorithm,
but I am still
a beginner in migration code, and I am probably missing something.

Best regards,
Leo
Peter Xu Sept. 8, 2021, 9:04 p.m. UTC | #21
On Wed, Sep 08, 2021 at 05:13:40PM -0300, Leonardo Bras Soares Passos wrote:
> On Tue, Sep 7, 2021 at 1:44 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote:
> > > I also suggested something like that, but I thought it could be good if we could
> > > fall back to io_writev() if we didn't have the zerocopy feature (or
> > > the async feature).
> > >
> > > What do you think?
> >
> > That fallback looks safe and ok, I'm just not sure whether it'll be of great
> > help.  E.g. if we provide an QIO api that allows both sync write and zero-copy
> > write (then we do the fallback when necessary), it means the buffer implication
> > applies too to this api, so it's easier to me to just detect the zero copy
> > capability and use one alternative.  Thanks,
> >
> > --
> > Peter Xu
> >
> 
> I was thinking this way (async method with fallback) we would allow code using
> the QIO api to just try to use the io_async_writev() whenever the code
> seems fit, and
> let the QIO channel layer to decide when it can be used
> (implementation provides,
> features available), and just fallback to io_writev() when it can't be used.

Sure, I think it depends on whether the whole sync/async interface will be the
same, e.g., when async api needs some callback registered then the interface
won't be the same, and the caller will be aware of that anyways.  Otherwise it
looks fine indeed.  Thanks,
Peter Xu Sept. 8, 2021, 9:09 p.m. UTC | #22
On Wed, Sep 08, 2021 at 05:25:50PM -0300, Leonardo Bras Soares Passos wrote:
> On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > > Possibly, yes. This really need David G's input since he understands
> > > the code in way more detail than me.
> >
> > Hmm I'm not entirely sure why we have the sync after each iteration;
> > the case I can think of is if we're doing async sending, we could have
> > two versions of the same page in flight (one from each iteration) -
> > you'd want those to get there in the right order.
> >
> > Dave
> 
> Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY,  we will in
> fact have the same page in flight twice, instead of two versions,
> given the buffer is
> sent as it is during transmission.

That's an interesting point, which looks even valid... :)

There can still be two versions depending on when the page is read and feed to
the NICs as the page can be changing during the window, but as long as the
latter sent page always lands later than the former page on dest node then it
looks ok.
Daniel P. Berrangé Sept. 8, 2021, 9:57 p.m. UTC | #23
On Wed, Sep 08, 2021 at 05:09:33PM -0400, Peter Xu wrote:
> On Wed, Sep 08, 2021 at 05:25:50PM -0300, Leonardo Bras Soares Passos wrote:
> > On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > > > Possibly, yes. This really need David G's input since he understands
> > > > the code in way more detail than me.
> > >
> > > Hmm I'm not entirely sure why we have the sync after each iteration;
> > > the case I can think of is if we're doing async sending, we could have
> > > two versions of the same page in flight (one from each iteration) -
> > > you'd want those to get there in the right order.
> > >
> > > Dave
> > 
> > Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY,  we will in
> > fact have the same page in flight twice, instead of two versions,
> > given the buffer is
> > sent as it is during transmission.
> 
> That's an interesting point, which looks even valid... :)
> 
> There can still be two versions depending on when the page is read and feed to
> the NICs as the page can be changing during the window, but as long as the
> latter sent page always lands later than the former page on dest node then it
> looks ok.

The really strange scenario is around TCP re-transmits. The kernel may
transmit page version 1, then we have version 2 written. The kerenl now
needs to retransmit a packet due to network loss. The re-transmission will
contain version 2 payload which differs from the originally lost packet's
payload.

IOW, the supposed "reliable" TCP stream is no longer reliable and actually
changes its data retrospectively because we've intentionally violated the
rules the kernel documented for use of MSG_ZEROCOPY.

We think we're probably ok with migration as we are going to rely on the
face that we eventually pause the guest to stop page changes during the
final switchover. None the less I really strongly dislike the idea of
not honouring the kernel API contract, despite the potential performance
benefits it brings.

Regards,
Daniel
Peter Xu Sept. 9, 2021, 2:05 a.m. UTC | #24
On Wed, Sep 08, 2021 at 10:57:06PM +0100, Daniel P. Berrangé wrote:
> We think we're probably ok with migration as we are going to rely on the
> face that we eventually pause the guest to stop page changes during the
> final switchover. None the less I really strongly dislike the idea of
> not honouring the kernel API contract, despite the potential performance
> benefits it brings.

Yes understandable, and it does looks tricky. But it's guest page and it's just
by nature how it works to me (sending page happening in parallel with page
being modified).

I think the MSG_ZEROCOPY doc page mentioned that and it's userspace program's
own choice if that happens. So even if it's not by design and not suggested, it
seems not forbidden either.

We can wr-protect the page (using things like userfaultfd-wp) during sending
and unprotect them when it's done with a qio callback, that'll guarantee the
buffer not changing during sending, however we gain nothing besides the "api
cleaness" then..

Thanks,
Leonardo Bras Sept. 9, 2021, 4:58 a.m. UTC | #25
On Wed, Sep 8, 2021 at 11:05 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 08, 2021 at 10:57:06PM +0100, Daniel P. Berrangé wrote:
> > We think we're probably ok with migration as we are going to rely on the
> > face that we eventually pause the guest to stop page changes during the
> > final switchover. None the less I really strongly dislike the idea of
> > not honouring the kernel API contract, despite the potential performance
> > benefits it brings.
>
> Yes understandable, and it does looks tricky. But it's guest page and it's just
> by nature how it works to me (sending page happening in parallel with page
> being modified).
>
> I think the MSG_ZEROCOPY doc page mentioned that and it's userspace program's
> own choice if that happens. So even if it's not by design and not suggested, it
> seems not forbidden either.
>
> We can wr-protect the page (using things like userfaultfd-wp) during sending
> and unprotect them when it's done with a qio callback, that'll guarantee the
> buffer not changing during sending, however we gain nothing besides the "api
> cleaness" then..
>
> Thanks,
>
> --
> Peter Xu
>

I can't stress enough how inexperienced I am in qemu codebase, but based on
the current discussion and my humble opinion, it would make sense if something
like an async_writev + async_flush method (or a callback instead) would be
offered by QIOChannel, and let the responsibility of "which flags to use",
"when to lock", and "how / when to flush" to the codebase using it.

I mean, it's clear how the sync requirements will be very different
depending on what
the using code will be sending with it, so It makes sense to me that
we offer the tool
and let it decide how it should be used (and possibly deal with any
consequences).

Is there anything that could go wrong with the above that I am missing?

About the callback proposed, I am not sure how that would work in an
efficient way.
Could someone help me with that?

FWIW, what I had in mind for a (theoretical) migration setup with
io_async_writev() + io_async_flush():
- For guest RAM we can decide not to rw_lock memory on zerocopy,
because there is no need,
- For headers, we can decide to not use async  (use io_writev() instead),
- flush() can happen each iteration of migration, or at each N
seconds, or at the end.

Thank you for the great discussion :)
Leonardo Bras
Dr. David Alan Gilbert Sept. 9, 2021, 8:49 a.m. UTC | #26
* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Sep 08, 2021 at 09:30:58AM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > What if we do the 'flush()' before we start post-copy, instead of after each
> > > > > > iteration? would that be enough?
> > > > > 
> > > > > Possibly, yes. This really need David G's input since he understands
> > > > > the code in way more detail than me.
> > > > 
> > > > Hmm I'm not entirely sure why we have the sync after each iteration;
> > > 
> > > We don't have that yet, do we?
> > 
> > I think multifd does; I think it calls multifd_send_sync_main sometimes,
> > which I *think* corresponds to iterations.
> 
> Oh.. Then we may need to:
> 
>   (1) Make that sync work for zero-copy too; say, semaphores may not be enough
>       with it - we need to make sure all async buffers are consumed by checking
>       the error queue of the sockets,
> 
>   (2) When we make zero-copy work without multi-fd, we may want some similar
>       sync on the main stream too

It might not be worth bothering with (2) - zerocopy fits a lot better
with multifd's data organisation.

Dave

> Thanks,
> 
> > 
> > Dave
> > 
> > > > the case I can think of is if we're doing async sending, we could have
> > > > two versions of the same page in flight (one from each iteration) -
> > > > you'd want those to get there in the right order.
> > > 
> > > From MSG_ZEROCOPY document:
> > > 
> > >         For protocols that acknowledge data in-order, like TCP, each
> > >         notification can be squashed into the previous one, so that no more
> > >         than one notification is outstanding at any one point.
> > > 
> > >         Ordered delivery is the common case, but not guaranteed. Notifications
> > >         may arrive out of order on retransmission and socket teardown.
> > > 
> > > So no matter whether we have a flush() per iteration before, it seems we may
> > > want one when zero copy enabled?
> > > 
> > > Thanks,
> > > 
> > > -- 
> > > Peter Xu
> > > 
> > -- 
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
> -- 
> Peter Xu
>
Peter Xu Sept. 9, 2021, 4:40 p.m. UTC | #27
On Thu, Sep 09, 2021 at 01:58:39AM -0300, Leonardo Bras Soares Passos wrote:
> FWIW, what I had in mind for a (theoretical) migration setup with
> io_async_writev() + io_async_flush():

One trivial concern is it's not strictly just "async" because "async" can
happen on any nonblocking fd; here it's more "zerocopy" specific.  But as long
as Dan is okay with it I'm fine too to start with "async", as long as we do
proper documentation on the requirement of lifecycle of the buffers.

> - For guest RAM we can decide not to rw_lock memory on zerocopy,
> because there is no need,
> - For headers, we can decide to not use async  (use io_writev() instead),
> - flush() can happen each iteration of migration, or at each N
> seconds, or at the end.

I think we should only need the flush at the end of precopy, and that should
also cover when switching to postcopy.  We could merge this flush() into the
existing per-iteration sync of multi-fd, but it's not strictly necessary, imho.
We can see which is easier. The rest looks good to me.  Thanks,
diff mbox series

Patch

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index e747e63514..09dffe059f 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -47,6 +47,8 @@  struct QIOChannelSocket {
     socklen_t localAddrLen;
     struct sockaddr_storage remoteAddr;
     socklen_t remoteAddrLen;
+    size_t errq_pending;
+    bool zerocopy_enabled;
 };
 
 
diff --git a/include/io/channel.h b/include/io/channel.h
index dada9ebaaf..de10a78b10 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -137,6 +137,8 @@  struct QIOChannelClass {
                                   IOHandler *io_read,
                                   IOHandler *io_write,
                                   void *opaque);
+    void (*io_set_zerocopy)(QIOChannel *ioc,
+                            bool enabled);
 };
 
 /* General I/O handling functions */
@@ -570,6 +572,33 @@  int qio_channel_shutdown(QIOChannel *ioc,
 void qio_channel_set_delay(QIOChannel *ioc,
                            bool enabled);
 
+/**
+ * qio_channel_set_zerocopy:
+ * @ioc: the channel object
+ * @enabled: the new flag state
+ *
+ * Controls whether the underlying transport is
+ * permitted to use zerocopy to avoid copying the
+ * sending buffer in kernel. If @enabled is true, then the
+ * writes may avoid buffer copy in kernel. If @enabled
+ * is false, writes will cause the kernel to always
+ * copy the buffer contents before sending.
+ *
+ * In order to use make a write with zerocopy feature,
+ * it's also necessary to sent each packet with
+ * MSG_ZEROCOPY flag. With this, it's possible to
+ * to select only writes that would benefit from the
+ * use of zerocopy feature, i.e. the ones with larger
+ * buffers.
+ *
+ * This feature was added in Linux 4.14, so older
+ * versions will fail on enabling. This is not an
+ * issue, since it will fall-back to default copying
+ * approach.
+ */
+void qio_channel_set_zerocopy(QIOChannel *ioc,
+                              bool enabled);
+
 /**
  * qio_channel_set_cork:
  * @ioc: the channel object
diff --git a/io/channel-socket.c b/io/channel-socket.c
index e377e7303d..a69fec7315 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -26,8 +26,10 @@ 
 #include "io/channel-watch.h"
 #include "trace.h"
 #include "qapi/clone-visitor.h"
+#include <linux/errqueue.h>
 
 #define SOCKET_MAX_FDS 16
+#define SOCKET_ERRQ_THRESH 16384
 
 SocketAddress *
 qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
@@ -55,6 +57,8 @@  qio_channel_socket_new(void)
 
     sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
     sioc->fd = -1;
+    sioc->zerocopy_enabled = false;
+    sioc->errq_pending = 0;
 
     ioc = QIO_CHANNEL(sioc);
     qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
@@ -520,6 +524,54 @@  static ssize_t qio_channel_socket_readv(QIOChannel *ioc,
     return ret;
 }
 
+static void qio_channel_socket_errq_proc(QIOChannelSocket *sioc,
+                                         Error **errp)
+{
+    int fd = sioc->fd;
+    int ret;
+    struct msghdr msg = {};
+    struct sock_extended_err *serr;
+    struct cmsghdr *cm;
+
+    do {
+        ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+        if (ret <= 0) {
+            if (ret == 0 || errno == EAGAIN) {
+                /* Nothing on errqueue */
+                 sioc->errq_pending = 0;
+                 break;
+            }
+            if (errno == EINTR) {
+                continue;
+            }
+
+            error_setg_errno(errp, errno,
+                             "Unable to read errqueue");
+            break;
+        }
+
+        cm = CMSG_FIRSTHDR(&msg);
+        if (cm->cmsg_level != SOL_IP &&
+            cm->cmsg_type != IP_RECVERR) {
+            error_setg_errno(errp, EPROTOTYPE,
+                             "Wrong cmsg in errqueue");
+            break;
+        }
+
+        serr = (void *) CMSG_DATA(cm);
+        if (serr->ee_errno != 0) {
+            error_setg_errno(errp, serr->ee_errno,
+                             "Error on socket");
+            break;
+        }
+        if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+            error_setg_errno(errp, serr->ee_origin,
+                             "Error not from zerocopy");
+            break;
+        }
+    } while (true);
+}
+
 static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
                                          const struct iovec *iov,
                                          size_t niov,
@@ -571,6 +623,14 @@  static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
                          "Unable to write to socket");
         return -1;
     }
+
+    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
+        sioc->errq_pending += niov;
+        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
+            qio_channel_socket_errq_proc(sioc, errp);
+        }
+    }
+
     return ret;
 }
 #else /* WIN32 */
@@ -689,6 +749,21 @@  qio_channel_socket_set_delay(QIOChannel *ioc,
 }
 
 
+static void
+qio_channel_socket_set_zerocopy(QIOChannel *ioc,
+                                bool enabled)
+{
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    int v = enabled ? 1 : 0;
+    int ret;
+
+    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
+    if (ret >= 0) {
+        sioc->zerocopy_enabled = true;
+    }
+}
+
+
 static void
 qio_channel_socket_set_cork(QIOChannel *ioc,
                             bool enabled)
@@ -789,6 +864,7 @@  static void qio_channel_socket_class_init(ObjectClass *klass,
     ioc_klass->io_set_delay = qio_channel_socket_set_delay;
     ioc_klass->io_create_watch = qio_channel_socket_create_watch;
     ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
+    ioc_klass->io_set_zerocopy = qio_channel_socket_set_zerocopy;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel-tls.c b/io/channel-tls.c
index 4ce890a538..bf44b0f7b0 100644
--- a/io/channel-tls.c
+++ b/io/channel-tls.c
@@ -350,6 +350,16 @@  static void qio_channel_tls_set_delay(QIOChannel *ioc,
     qio_channel_set_delay(tioc->master, enabled);
 }
 
+
+static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
+                                         bool enabled)
+{
+    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
+
+    qio_channel_set_zerocopy(tioc->master, enabled);
+}
+
+
 static void qio_channel_tls_set_cork(QIOChannel *ioc,
                                      bool enabled)
 {
@@ -416,6 +426,7 @@  static void qio_channel_tls_class_init(ObjectClass *klass,
     ioc_klass->io_shutdown = qio_channel_tls_shutdown;
     ioc_klass->io_create_watch = qio_channel_tls_create_watch;
     ioc_klass->io_set_aio_fd_handler = qio_channel_tls_set_aio_fd_handler;
+    ioc_klass->io_set_zerocopy = qio_channel_tls_set_zerocopy;
 }
 
 static const TypeInfo qio_channel_tls_info = {
diff --git a/io/channel-websock.c b/io/channel-websock.c
index 035dd6075b..4e9491966b 100644
--- a/io/channel-websock.c
+++ b/io/channel-websock.c
@@ -1194,6 +1194,14 @@  static void qio_channel_websock_set_delay(QIOChannel *ioc,
     qio_channel_set_delay(tioc->master, enabled);
 }
 
+static void qio_channel_websock_set_zerocopy(QIOChannel *ioc,
+                                             bool enabled)
+{
+    QIOChannelWebsock *tioc = QIO_CHANNEL_WEBSOCK(ioc);
+
+    qio_channel_set_zerocopy(tioc->master, enabled);
+}
+
 static void qio_channel_websock_set_cork(QIOChannel *ioc,
                                          bool enabled)
 {
@@ -1318,6 +1326,7 @@  static void qio_channel_websock_class_init(ObjectClass *klass,
     ioc_klass->io_close = qio_channel_websock_close;
     ioc_klass->io_shutdown = qio_channel_websock_shutdown;
     ioc_klass->io_create_watch = qio_channel_websock_create_watch;
+    ioc_klass->io_set_zerocopy = qio_channel_websock_set_zerocopy;
 }
 
 static const TypeInfo qio_channel_websock_info = {
diff --git a/io/channel.c b/io/channel.c
index ee3cb83d4d..476440e8b2 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -450,6 +450,17 @@  void qio_channel_set_delay(QIOChannel *ioc,
 }
 
 
+void qio_channel_set_zerocopy(QIOChannel *ioc,
+                              bool enabled)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (klass->io_set_zerocopy) {
+        klass->io_set_zerocopy(ioc, enabled);
+    }
+}
+
+
 void qio_channel_set_cork(QIOChannel *ioc,
                           bool enabled)
 {