[v2] 9pfs: Correctly handle cancelled requests

Message ID 20171204203619.GA16349@juliacomputing.com
State New
Headers show
Series
  • [v2] 9pfs: Correctly handle cancelled requests
Related show

Commit Message

Keno Fischer Dec. 4, 2017, 8:36 p.m.
# Background

I was investigating spurious non-deterministic EINTR returns from
various 9p file system operations in a Linux guest served from the
qemu 9p server.

 ## EINTR, ERESTARTSYS and the linux kernel

When a signal arrives that the Linux kernel needs to deliver to user-space
while a given thread is blocked (in the 9p case waiting for a reply to its
request in 9p_client_rpc -> wait_event_interruptible), it asks whatever
driver is currently running to abort its current operation (in the 9p case
causing the submission of a TFLUSH message) and return to user space.
In these situations, the error message reported is generally ERESTARTSYS.
If the userspace processes specified SA_RESTART, this means that the
system call will get restarted upon completion of the signal handler
delivery (assuming the signal handler doesn't modify the process state
in complicated ways not relevant here). If SA_RESTART is not specified,
ERESTARTSYS gets translated to EINTR and user space is expected to handle
the restart itself.

 ## The 9p TFLISH command

The 9p TFLUSH commands requests that the server abort an ongoing operation.
The man page [1] specifies:

```
If it recognizes oldtag as the tag of a pending transaction, it should abort any
pending response and discard that tag.
[...]
When the client sends a Tflush, it must wait to receive the corresponding Rflush
before reusing oldtag for subsequent messages. If a response to the flushed request
is received before the Rflush, the client must honor the response as if it had not
been flushed, since the completed request may signify a state change in the server
```

In particular, this means that the server must not send a reply with the orignal
tag in response to the cancellation request, because the client is obligated
to interpret such a reply as a coincidental reply to the original request.

 # The bug

When qemu receives a TFlush request, it sets the `cancelled` flag on the relevant
pdu. This flag is periodically checked, e.g. in `v9fs_co_name_to_path`, and if
set, the operation is aborted and the error is set to EINTR. However, the server
then violates the spec, by returning to the client an Rerror response, rather
than discarding the message entirely. As a result, the client is required
to assume that said Rerror response is a result of the original request, not
a result of the cancellation and thus passes the EINTR error back to user space.
This is not the worst thing it could do, however as discussed above, the correct
error code would have been ERESTARTSYS, such that user space programs with
SA_RESTART set get correctly restarted upon completion of the signal handler.
Instead, such programs get spurious EINTR results that they were not expecting
to handle.

It should be noted that there are plenty of user space programs that do not
set SA_RESTART and do not correctly handle EINTR either. However, that is then
a userspace bug. It should also be noted that this bug has been mitigated by
a recent commit to the Linux kernel [2], which essentially prevents the kernel
from sending Tflush requests unless the process is about to die (in which case
the process likely doesn't care about the response). Nevertheless, for older
kernels and to comply with the spec, I believe this change is beneficial.

 # Implementation

The fix is fairly simple, just skipping notification of a reply if
the pdu was previously cancelled. We do however, also notify the transport
layer that we're doing this, so it can clean up any resources it may be
holding. I also added a new trace event to distinguish
operations that caused an error reply from those that were cancelled.

One complication is that we only omit sending the message on EINTR errors in
order to avoid confusing the rest of the code (which may assume that a
client knows about a fid if it sucessfully passed it off to pud_complete
without checking for cancellation status). This does mean that if the server
acts upon the cancellation flag, it always needs to set err to EINTR. I believe
this is true of the current code.

[1] https://9fans.github.io/plan9port/man/man9/flush.html
[2] https://github.com/torvalds/linux/commit/9523feac272ccad2ad8186ba4fcc89103754de52

Signed-off-by: Keno Fischer <keno@juliacomputing.com>
---

Changes from v1:
 - In reponse to review by Greg Kurz, add a new transport
   layer operation to discard the buffer. I also attempted
   an implementation for xen, but I have done no verification
   on that beyond making sure it compiles, since I don't know
   how to use xen. Please review closely.

 hw/9pfs/9p.c               | 18 ++++++++++++++++++
 hw/9pfs/9p.h               |  1 +
 hw/9pfs/trace-events       |  1 +
 hw/9pfs/virtio-9p-device.c | 14 ++++++++++++++
 hw/9pfs/xen-9p-backend.c   | 11 +++++++++++
 5 files changed, 45 insertions(+)

Comments

Greg Kurz Dec. 7, 2017, 5:04 p.m. | #1
On Mon, 4 Dec 2017 15:36:19 -0500
Keno Fischer <keno@juliacomputing.com> wrote:

>  # Background
> 
> I was investigating spurious non-deterministic EINTR returns from
> various 9p file system operations in a Linux guest served from the
> qemu 9p server.
> 
>  ## EINTR, ERESTARTSYS and the linux kernel
> 
> When a signal arrives that the Linux kernel needs to deliver to user-space
> while a given thread is blocked (in the 9p case waiting for a reply to its
> request in 9p_client_rpc -> wait_event_interruptible), it asks whatever
> driver is currently running to abort its current operation (in the 9p case
> causing the submission of a TFLUSH message) and return to user space.
> In these situations, the error message reported is generally ERESTARTSYS.
> If the userspace processes specified SA_RESTART, this means that the
> system call will get restarted upon completion of the signal handler
> delivery (assuming the signal handler doesn't modify the process state
> in complicated ways not relevant here). If SA_RESTART is not specified,
> ERESTARTSYS gets translated to EINTR and user space is expected to handle
> the restart itself.
> 
>  ## The 9p TFLISH command
               ^^^
I'll fix this before pushing to my tree in case there's no v3.

> 
> The 9p TFLUSH commands requests that the server abort an ongoing operation.
> The man page [1] specifies:
> 
> ```
> If it recognizes oldtag as the tag of a pending transaction, it should abort any
> pending response and discard that tag.
> [...]
> When the client sends a Tflush, it must wait to receive the corresponding Rflush
> before reusing oldtag for subsequent messages. If a response to the flushed request
> is received before the Rflush, the client must honor the response as if it had not
> been flushed, since the completed request may signify a state change in the server
> ```
> 
> In particular, this means that the server must not send a reply with the orignal
> tag in response to the cancellation request, because the client is obligated
> to interpret such a reply as a coincidental reply to the original request.
> 
>  # The bug
> 
> When qemu receives a TFlush request, it sets the `cancelled` flag on the relevant
> pdu. This flag is periodically checked, e.g. in `v9fs_co_name_to_path`, and if
> set, the operation is aborted and the error is set to EINTR. However, the server
> then violates the spec, by returning to the client an Rerror response, rather
> than discarding the message entirely. As a result, the client is required
> to assume that said Rerror response is a result of the original request, not
> a result of the cancellation and thus passes the EINTR error back to user space.
> This is not the worst thing it could do, however as discussed above, the correct
> error code would have been ERESTARTSYS, such that user space programs with
> SA_RESTART set get correctly restarted upon completion of the signal handler.
> Instead, such programs get spurious EINTR results that they were not expecting
> to handle.
> 
> It should be noted that there are plenty of user space programs that do not
> set SA_RESTART and do not correctly handle EINTR either. However, that is then
> a userspace bug. It should also be noted that this bug has been mitigated by
> a recent commit to the Linux kernel [2], which essentially prevents the kernel
> from sending Tflush requests unless the process is about to die (in which case
> the process likely doesn't care about the response). Nevertheless, for older
> kernels and to comply with the spec, I believe this change is beneficial.
> 
>  # Implementation
> 
> The fix is fairly simple, just skipping notification of a reply if
> the pdu was previously cancelled. We do however, also notify the transport
> layer that we're doing this, so it can clean up any resources it may be
> holding. I also added a new trace event to distinguish
> operations that caused an error reply from those that were cancelled.
> 
> One complication is that we only omit sending the message on EINTR errors in
> order to avoid confusing the rest of the code (which may assume that a
> client knows about a fid if it sucessfully passed it off to pud_complete
> without checking for cancellation status). This does mean that if the server
> acts upon the cancellation flag, it always needs to set err to EINTR. I believe
> this is true of the current code.
> 
> [1] https://9fans.github.io/plan9port/man/man9/flush.html
> [2] https://github.com/torvalds/linux/commit/9523feac272ccad2ad8186ba4fcc89103754de52
> 
> Signed-off-by: Keno Fischer <keno@juliacomputing.com>
> ---
> 

It looks good. I could come up with a new qtest for TFLUSH and validate it
behaves as expected. I'll post that soon.

For the 9p and virtio parts.

Reviewed-by: Greg Kurz <groug@kaod.org>

Now I'd need an ack from Stefano for the xen bits before I push it to my tree.

Thanks,

--
Greg

> Changes from v1:
>  - In reponse to review by Greg Kurz, add a new transport
>    layer operation to discard the buffer. I also attempted
>    an implementation for xen, but I have done no verification
>    on that beyond making sure it compiles, since I don't know
>    how to use xen. Please review closely.
> 
>  hw/9pfs/9p.c               | 18 ++++++++++++++++++
>  hw/9pfs/9p.h               |  1 +
>  hw/9pfs/trace-events       |  1 +
>  hw/9pfs/virtio-9p-device.c | 14 ++++++++++++++
>  hw/9pfs/xen-9p-backend.c   | 11 +++++++++++
>  5 files changed, 45 insertions(+)
> 
> diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> index 710cd91..daa8519 100644
> --- a/hw/9pfs/9p.c
> +++ b/hw/9pfs/9p.c
> @@ -648,6 +648,23 @@ static void coroutine_fn pdu_complete(V9fsPDU *pdu, ssize_t len)
>      V9fsState *s = pdu->s;
>      int ret;
>  
> +    /*
> +     * The 9p spec requires that successfully cancelled pdus receive no reply.
> +     * Sending a reply would confuse clients because they would
> +     * assume that any EINTR is the actual result of the operation,
> +     * rather than a consequence of the cancellation. However, if
> +     * the operation completed (succesfully or with an error other
> +     * than caused be cancellation), we do send out that reply, both
> +     * for efficiency and to avoid confusing the rest of the state machine
> +     * that assumes passing a non-error here will mean a successful
> +     * transmission of the reply.
> +     */
> +    if (pdu->cancelled && len == -EINTR) {
> +        trace_v9fs_rcancel(pdu->tag, pdu->id);
> +        pdu->s->transport->discard(pdu);
> +        goto out_wakeup;
> +    }
> +
>      if (len < 0) {
>          int err = -len;
>          len = 7;
> @@ -690,6 +707,7 @@ static void coroutine_fn pdu_complete(V9fsPDU *pdu, ssize_t len)
>  out_notify:
>      pdu->s->transport->push_and_notify(pdu);
>  
> +out_wakeup:
>      /* Now wakeup anybody waiting in flush for this request */
>      if (!qemu_co_queue_next(&pdu->complete)) {
>          pdu_free(pdu);
> diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
> index d1cfeaf..3c1b0b5 100644
> --- a/hw/9pfs/9p.h
> +++ b/hw/9pfs/9p.h
> @@ -365,6 +365,7 @@ struct V9fsTransport {
>      void        (*init_out_iov_from_pdu)(V9fsPDU *pdu, struct iovec **piov,
>                                           unsigned int *pniov, size_t size);
>      void        (*push_and_notify)(V9fsPDU *pdu);
> +    void        (*discard)(V9fsPDU *pdu);
>  };
>  
>  static inline int v9fs_register_transport(V9fsState *s,
> diff --git a/hw/9pfs/trace-events b/hw/9pfs/trace-events
> index 08a4abf..1aee350 100644
> --- a/hw/9pfs/trace-events
> +++ b/hw/9pfs/trace-events
> @@ -1,6 +1,7 @@
>  # See docs/devel/tracing.txt for syntax documentation.
>  
>  # hw/9pfs/virtio-9p.c
> +v9fs_rcancel(uint16_t tag, uint8_t id) "tag %d id %d"
>  v9fs_rerror(uint16_t tag, uint8_t id, int err) "tag %d id %d err %d"
>  v9fs_version(uint16_t tag, uint8_t id, int32_t msize, char* version) "tag %d id %d msize %d version %s"
>  v9fs_version_return(uint16_t tag, uint8_t id, int32_t msize, char* version) "tag %d id %d msize %d version %s"
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 62650b0..2510329 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -37,6 +37,19 @@ static void virtio_9p_push_and_notify(V9fsPDU *pdu)
>      virtio_notify(VIRTIO_DEVICE(v), v->vq);
>  }
>  
> +
> +static void virtio_pdu_discard(V9fsPDU *pdu)
> +{
> +    V9fsState *s = pdu->s;
> +    V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
> +    VirtQueueElement *elem = v->elems[pdu->idx];
> +
> +    /* discard element from the queue */
> +    virtqueue_detach_element(v->vq, elem, pdu->size);
> +    g_free(elem);
> +    v->elems[pdu->idx] = NULL;
> +}
> +
>  static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      V9fsVirtioState *v = (V9fsVirtioState *)vdev;
> @@ -221,6 +234,7 @@ static const struct V9fsTransport virtio_9p_transport = {
>      .init_in_iov_from_pdu = virtio_init_in_iov_from_pdu,
>      .init_out_iov_from_pdu = virtio_init_out_iov_from_pdu,
>      .push_and_notify = virtio_9p_push_and_notify,
> +    .discard = virtio_pdu_discard,
>  };
>  
>  /* virtio-9p device */
> diff --git a/hw/9pfs/xen-9p-backend.c b/hw/9pfs/xen-9p-backend.c
> index ee87f08..7208ce6 100644
> --- a/hw/9pfs/xen-9p-backend.c
> +++ b/hw/9pfs/xen-9p-backend.c
> @@ -233,12 +233,23 @@ static void xen_9pfs_push_and_notify(V9fsPDU *pdu)
>      qemu_bh_schedule(ring->bh);
>  }
>  
> +static void xen_9pfs_discard(V9fsPDU *pdu)
> +{
> +    Xen9pfsDev *priv = container_of(pdu->s, Xen9pfsDev, state);
> +    Xen9pfsRing *ring = &priv->rings[pdu->tag % priv->num_rings];
> +
> +    g_free(ring->sg);
> +    ring->sg = NULL;
> +    ring->inprogress = false;
> +}
> +
>  static const struct V9fsTransport xen_9p_transport = {
>      .pdu_vmarshal = xen_9pfs_pdu_vmarshal,
>      .pdu_vunmarshal = xen_9pfs_pdu_vunmarshal,
>      .init_in_iov_from_pdu = xen_9pfs_init_in_iov_from_pdu,
>      .init_out_iov_from_pdu = xen_9pfs_init_out_iov_from_pdu,
>      .push_and_notify = xen_9pfs_push_and_notify,
> +    .discard = xen_9pfs_discard,
>  };
>  
>  static int xen_9pfs_init(struct XenDevice *xendev)
Greg Kurz Dec. 8, 2017, 6:57 a.m. | #2
Cc'ing Stefano using a more appropriate address :)

On Thu, 7 Dec 2017 18:04:24 +0100
Greg Kurz <groug@kaod.org> wrote:

> On Mon, 4 Dec 2017 15:36:19 -0500
> Keno Fischer <keno@juliacomputing.com> wrote:
> 
> >  # Background
> > 
> > I was investigating spurious non-deterministic EINTR returns from
> > various 9p file system operations in a Linux guest served from the
> > qemu 9p server.
> > 
> >  ## EINTR, ERESTARTSYS and the linux kernel
> > 
> > When a signal arrives that the Linux kernel needs to deliver to user-space
> > while a given thread is blocked (in the 9p case waiting for a reply to its
> > request in 9p_client_rpc -> wait_event_interruptible), it asks whatever
> > driver is currently running to abort its current operation (in the 9p case
> > causing the submission of a TFLUSH message) and return to user space.
> > In these situations, the error message reported is generally ERESTARTSYS.
> > If the userspace processes specified SA_RESTART, this means that the
> > system call will get restarted upon completion of the signal handler
> > delivery (assuming the signal handler doesn't modify the process state
> > in complicated ways not relevant here). If SA_RESTART is not specified,
> > ERESTARTSYS gets translated to EINTR and user space is expected to handle
> > the restart itself.
> > 
> >  ## The 9p TFLISH command  
>                ^^^
> I'll fix this before pushing to my tree in case there's no v3.
> 
> > 
> > The 9p TFLUSH commands requests that the server abort an ongoing operation.
> > The man page [1] specifies:
> > 
> > ```
> > If it recognizes oldtag as the tag of a pending transaction, it should abort any
> > pending response and discard that tag.
> > [...]
> > When the client sends a Tflush, it must wait to receive the corresponding Rflush
> > before reusing oldtag for subsequent messages. If a response to the flushed request
> > is received before the Rflush, the client must honor the response as if it had not
> > been flushed, since the completed request may signify a state change in the server
> > ```
> > 
> > In particular, this means that the server must not send a reply with the orignal
> > tag in response to the cancellation request, because the client is obligated
> > to interpret such a reply as a coincidental reply to the original request.
> > 
> >  # The bug
> > 
> > When qemu receives a TFlush request, it sets the `cancelled` flag on the relevant
> > pdu. This flag is periodically checked, e.g. in `v9fs_co_name_to_path`, and if
> > set, the operation is aborted and the error is set to EINTR. However, the server
> > then violates the spec, by returning to the client an Rerror response, rather
> > than discarding the message entirely. As a result, the client is required
> > to assume that said Rerror response is a result of the original request, not
> > a result of the cancellation and thus passes the EINTR error back to user space.
> > This is not the worst thing it could do, however as discussed above, the correct
> > error code would have been ERESTARTSYS, such that user space programs with
> > SA_RESTART set get correctly restarted upon completion of the signal handler.
> > Instead, such programs get spurious EINTR results that they were not expecting
> > to handle.
> > 
> > It should be noted that there are plenty of user space programs that do not
> > set SA_RESTART and do not correctly handle EINTR either. However, that is then
> > a userspace bug. It should also be noted that this bug has been mitigated by
> > a recent commit to the Linux kernel [2], which essentially prevents the kernel
> > from sending Tflush requests unless the process is about to die (in which case
> > the process likely doesn't care about the response). Nevertheless, for older
> > kernels and to comply with the spec, I believe this change is beneficial.
> > 
> >  # Implementation
> > 
> > The fix is fairly simple, just skipping notification of a reply if
> > the pdu was previously cancelled. We do however, also notify the transport
> > layer that we're doing this, so it can clean up any resources it may be
> > holding. I also added a new trace event to distinguish
> > operations that caused an error reply from those that were cancelled.
> > 
> > One complication is that we only omit sending the message on EINTR errors in
> > order to avoid confusing the rest of the code (which may assume that a
> > client knows about a fid if it sucessfully passed it off to pud_complete
> > without checking for cancellation status). This does mean that if the server
> > acts upon the cancellation flag, it always needs to set err to EINTR. I believe
> > this is true of the current code.
> > 
> > [1] https://9fans.github.io/plan9port/man/man9/flush.html
> > [2] https://github.com/torvalds/linux/commit/9523feac272ccad2ad8186ba4fcc89103754de52
> > 
> > Signed-off-by: Keno Fischer <keno@juliacomputing.com>
> > ---
> >   
> 
> It looks good. I could come up with a new qtest for TFLUSH and validate it
> behaves as expected. I'll post that soon.
> 
> For the 9p and virtio parts.
> 
> Reviewed-by: Greg Kurz <groug@kaod.org>
> 
> Now I'd need an ack from Stefano for the xen bits before I push it to my tree.
> 
> Thanks,
> 
> --
> Greg
> 
> > Changes from v1:
> >  - In reponse to review by Greg Kurz, add a new transport
> >    layer operation to discard the buffer. I also attempted
> >    an implementation for xen, but I have done no verification
> >    on that beyond making sure it compiles, since I don't know
> >    how to use xen. Please review closely.
> > 
> >  hw/9pfs/9p.c               | 18 ++++++++++++++++++
> >  hw/9pfs/9p.h               |  1 +
> >  hw/9pfs/trace-events       |  1 +
> >  hw/9pfs/virtio-9p-device.c | 14 ++++++++++++++
> >  hw/9pfs/xen-9p-backend.c   | 11 +++++++++++
> >  5 files changed, 45 insertions(+)
> > 
> > diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> > index 710cd91..daa8519 100644
> > --- a/hw/9pfs/9p.c
> > +++ b/hw/9pfs/9p.c
> > @@ -648,6 +648,23 @@ static void coroutine_fn pdu_complete(V9fsPDU *pdu, ssize_t len)
> >      V9fsState *s = pdu->s;
> >      int ret;
> >  
> > +    /*
> > +     * The 9p spec requires that successfully cancelled pdus receive no reply.
> > +     * Sending a reply would confuse clients because they would
> > +     * assume that any EINTR is the actual result of the operation,
> > +     * rather than a consequence of the cancellation. However, if
> > +     * the operation completed (succesfully or with an error other
> > +     * than caused be cancellation), we do send out that reply, both
> > +     * for efficiency and to avoid confusing the rest of the state machine
> > +     * that assumes passing a non-error here will mean a successful
> > +     * transmission of the reply.
> > +     */
> > +    if (pdu->cancelled && len == -EINTR) {
> > +        trace_v9fs_rcancel(pdu->tag, pdu->id);
> > +        pdu->s->transport->discard(pdu);
> > +        goto out_wakeup;
> > +    }
> > +
> >      if (len < 0) {
> >          int err = -len;
> >          len = 7;
> > @@ -690,6 +707,7 @@ static void coroutine_fn pdu_complete(V9fsPDU *pdu, ssize_t len)
> >  out_notify:
> >      pdu->s->transport->push_and_notify(pdu);
> >  
> > +out_wakeup:
> >      /* Now wakeup anybody waiting in flush for this request */
> >      if (!qemu_co_queue_next(&pdu->complete)) {
> >          pdu_free(pdu);
> > diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
> > index d1cfeaf..3c1b0b5 100644
> > --- a/hw/9pfs/9p.h
> > +++ b/hw/9pfs/9p.h
> > @@ -365,6 +365,7 @@ struct V9fsTransport {
> >      void        (*init_out_iov_from_pdu)(V9fsPDU *pdu, struct iovec **piov,
> >                                           unsigned int *pniov, size_t size);
> >      void        (*push_and_notify)(V9fsPDU *pdu);
> > +    void        (*discard)(V9fsPDU *pdu);
> >  };
> >  
> >  static inline int v9fs_register_transport(V9fsState *s,
> > diff --git a/hw/9pfs/trace-events b/hw/9pfs/trace-events
> > index 08a4abf..1aee350 100644
> > --- a/hw/9pfs/trace-events
> > +++ b/hw/9pfs/trace-events
> > @@ -1,6 +1,7 @@
> >  # See docs/devel/tracing.txt for syntax documentation.
> >  
> >  # hw/9pfs/virtio-9p.c
> > +v9fs_rcancel(uint16_t tag, uint8_t id) "tag %d id %d"
> >  v9fs_rerror(uint16_t tag, uint8_t id, int err) "tag %d id %d err %d"
> >  v9fs_version(uint16_t tag, uint8_t id, int32_t msize, char* version) "tag %d id %d msize %d version %s"
> >  v9fs_version_return(uint16_t tag, uint8_t id, int32_t msize, char* version) "tag %d id %d msize %d version %s"
> > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > index 62650b0..2510329 100644
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -37,6 +37,19 @@ static void virtio_9p_push_and_notify(V9fsPDU *pdu)
> >      virtio_notify(VIRTIO_DEVICE(v), v->vq);
> >  }
> >  
> > +
> > +static void virtio_pdu_discard(V9fsPDU *pdu)
> > +{
> > +    V9fsState *s = pdu->s;
> > +    V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
> > +    VirtQueueElement *elem = v->elems[pdu->idx];
> > +
> > +    /* discard element from the queue */
> > +    virtqueue_detach_element(v->vq, elem, pdu->size);
> > +    g_free(elem);
> > +    v->elems[pdu->idx] = NULL;
> > +}
> > +
> >  static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
> >  {
> >      V9fsVirtioState *v = (V9fsVirtioState *)vdev;
> > @@ -221,6 +234,7 @@ static const struct V9fsTransport virtio_9p_transport = {
> >      .init_in_iov_from_pdu = virtio_init_in_iov_from_pdu,
> >      .init_out_iov_from_pdu = virtio_init_out_iov_from_pdu,
> >      .push_and_notify = virtio_9p_push_and_notify,
> > +    .discard = virtio_pdu_discard,
> >  };
> >  
> >  /* virtio-9p device */
> > diff --git a/hw/9pfs/xen-9p-backend.c b/hw/9pfs/xen-9p-backend.c
> > index ee87f08..7208ce6 100644
> > --- a/hw/9pfs/xen-9p-backend.c
> > +++ b/hw/9pfs/xen-9p-backend.c
> > @@ -233,12 +233,23 @@ static void xen_9pfs_push_and_notify(V9fsPDU *pdu)
> >      qemu_bh_schedule(ring->bh);
> >  }
> >  
> > +static void xen_9pfs_discard(V9fsPDU *pdu)
> > +{
> > +    Xen9pfsDev *priv = container_of(pdu->s, Xen9pfsDev, state);
> > +    Xen9pfsRing *ring = &priv->rings[pdu->tag % priv->num_rings];
> > +
> > +    g_free(ring->sg);
> > +    ring->sg = NULL;
> > +    ring->inprogress = false;
> > +}
> > +
> >  static const struct V9fsTransport xen_9p_transport = {
> >      .pdu_vmarshal = xen_9pfs_pdu_vmarshal,
> >      .pdu_vunmarshal = xen_9pfs_pdu_vunmarshal,
> >      .init_in_iov_from_pdu = xen_9pfs_init_in_iov_from_pdu,
> >      .init_out_iov_from_pdu = xen_9pfs_init_out_iov_from_pdu,
> >      .push_and_notify = xen_9pfs_push_and_notify,
> > +    .discard = xen_9pfs_discard,
> >  };
> >  
> >  static int xen_9pfs_init(struct XenDevice *xendev)  
> 
>
Stefano Stabellini Dec. 12, 2017, 11:14 p.m. | #3
On Fri, 8 Dec 2017, Greg Kurz wrote:
> Cc'ing Stefano using a more appropriate address :)

Thanks Greg for the ping, one comment inline below.


> On Thu, 7 Dec 2017 18:04:24 +0100
> Greg Kurz <groug@kaod.org> wrote:
> 
> > On Mon, 4 Dec 2017 15:36:19 -0500
> > Keno Fischer <keno@juliacomputing.com> wrote:
> > 
> > >  # Background
> > > 
> > > I was investigating spurious non-deterministic EINTR returns from
> > > various 9p file system operations in a Linux guest served from the
> > > qemu 9p server.
> > > 
> > >  ## EINTR, ERESTARTSYS and the linux kernel
> > > 
> > > When a signal arrives that the Linux kernel needs to deliver to user-space
> > > while a given thread is blocked (in the 9p case waiting for a reply to its
> > > request in 9p_client_rpc -> wait_event_interruptible), it asks whatever
> > > driver is currently running to abort its current operation (in the 9p case
> > > causing the submission of a TFLUSH message) and return to user space.
> > > In these situations, the error message reported is generally ERESTARTSYS.
> > > If the userspace processes specified SA_RESTART, this means that the
> > > system call will get restarted upon completion of the signal handler
> > > delivery (assuming the signal handler doesn't modify the process state
> > > in complicated ways not relevant here). If SA_RESTART is not specified,
> > > ERESTARTSYS gets translated to EINTR and user space is expected to handle
> > > the restart itself.
> > > 
> > >  ## The 9p TFLISH command  
> >                ^^^
> > I'll fix this before pushing to my tree in case there's no v3.
> > 
> > > 
> > > The 9p TFLUSH commands requests that the server abort an ongoing operation.
> > > The man page [1] specifies:
> > > 
> > > ```
> > > If it recognizes oldtag as the tag of a pending transaction, it should abort any
> > > pending response and discard that tag.
> > > [...]
> > > When the client sends a Tflush, it must wait to receive the corresponding Rflush
> > > before reusing oldtag for subsequent messages. If a response to the flushed request
> > > is received before the Rflush, the client must honor the response as if it had not
> > > been flushed, since the completed request may signify a state change in the server
> > > ```
> > > 
> > > In particular, this means that the server must not send a reply with the orignal
> > > tag in response to the cancellation request, because the client is obligated
> > > to interpret such a reply as a coincidental reply to the original request.
> > > 
> > >  # The bug
> > > 
> > > When qemu receives a TFlush request, it sets the `cancelled` flag on the relevant
> > > pdu. This flag is periodically checked, e.g. in `v9fs_co_name_to_path`, and if
> > > set, the operation is aborted and the error is set to EINTR. However, the server
> > > then violates the spec, by returning to the client an Rerror response, rather
> > > than discarding the message entirely. As a result, the client is required
> > > to assume that said Rerror response is a result of the original request, not
> > > a result of the cancellation and thus passes the EINTR error back to user space.
> > > This is not the worst thing it could do, however as discussed above, the correct
> > > error code would have been ERESTARTSYS, such that user space programs with
> > > SA_RESTART set get correctly restarted upon completion of the signal handler.
> > > Instead, such programs get spurious EINTR results that they were not expecting
> > > to handle.
> > > 
> > > It should be noted that there are plenty of user space programs that do not
> > > set SA_RESTART and do not correctly handle EINTR either. However, that is then
> > > a userspace bug. It should also be noted that this bug has been mitigated by
> > > a recent commit to the Linux kernel [2], which essentially prevents the kernel
> > > from sending Tflush requests unless the process is about to die (in which case
> > > the process likely doesn't care about the response). Nevertheless, for older
> > > kernels and to comply with the spec, I believe this change is beneficial.
> > > 
> > >  # Implementation
> > > 
> > > The fix is fairly simple, just skipping notification of a reply if
> > > the pdu was previously cancelled. We do however, also notify the transport
> > > layer that we're doing this, so it can clean up any resources it may be
> > > holding. I also added a new trace event to distinguish
> > > operations that caused an error reply from those that were cancelled.
> > > 
> > > One complication is that we only omit sending the message on EINTR errors in
> > > order to avoid confusing the rest of the code (which may assume that a
> > > client knows about a fid if it sucessfully passed it off to pud_complete
> > > without checking for cancellation status). This does mean that if the server
> > > acts upon the cancellation flag, it always needs to set err to EINTR. I believe
> > > this is true of the current code.
> > > 
> > > [1] https://9fans.github.io/plan9port/man/man9/flush.html
> > > [2] https://github.com/torvalds/linux/commit/9523feac272ccad2ad8186ba4fcc89103754de52
> > > 
> > > Signed-off-by: Keno Fischer <keno@juliacomputing.com>
> > > ---
> > >   
> > 
> > It looks good. I could come up with a new qtest for TFLUSH and validate it
> > behaves as expected. I'll post that soon.
> > 
> > For the 9p and virtio parts.
> > 
> > Reviewed-by: Greg Kurz <groug@kaod.org>
> > 
> > Now I'd need an ack from Stefano for the xen bits before I push it to my tree.
> > 
> > Thanks,
> > 
> > --
> > Greg
> > 
> > > Changes from v1:
> > >  - In reponse to review by Greg Kurz, add a new transport
> > >    layer operation to discard the buffer. I also attempted
> > >    an implementation for xen, but I have done no verification
> > >    on that beyond making sure it compiles, since I don't know
> > >    how to use xen. Please review closely.
> > > 
> > >  hw/9pfs/9p.c               | 18 ++++++++++++++++++
> > >  hw/9pfs/9p.h               |  1 +
> > >  hw/9pfs/trace-events       |  1 +
> > >  hw/9pfs/virtio-9p-device.c | 14 ++++++++++++++
> > >  hw/9pfs/xen-9p-backend.c   | 11 +++++++++++
> > >  5 files changed, 45 insertions(+)
> > > 
> > > diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> > > index 710cd91..daa8519 100644
> > > --- a/hw/9pfs/9p.c
> > > +++ b/hw/9pfs/9p.c
> > > @@ -648,6 +648,23 @@ static void coroutine_fn pdu_complete(V9fsPDU *pdu, ssize_t len)
> > >      V9fsState *s = pdu->s;
> > >      int ret;
> > >  
> > > +    /*
> > > +     * The 9p spec requires that successfully cancelled pdus receive no reply.
> > > +     * Sending a reply would confuse clients because they would
> > > +     * assume that any EINTR is the actual result of the operation,
> > > +     * rather than a consequence of the cancellation. However, if
> > > +     * the operation completed (succesfully or with an error other
> > > +     * than caused be cancellation), we do send out that reply, both
> > > +     * for efficiency and to avoid confusing the rest of the state machine
> > > +     * that assumes passing a non-error here will mean a successful
> > > +     * transmission of the reply.
> > > +     */
> > > +    if (pdu->cancelled && len == -EINTR) {
> > > +        trace_v9fs_rcancel(pdu->tag, pdu->id);
> > > +        pdu->s->transport->discard(pdu);
> > > +        goto out_wakeup;
> > > +    }
> > > +
> > >      if (len < 0) {
> > >          int err = -len;
> > >          len = 7;
> > > @@ -690,6 +707,7 @@ static void coroutine_fn pdu_complete(V9fsPDU *pdu, ssize_t len)
> > >  out_notify:
> > >      pdu->s->transport->push_and_notify(pdu);
> > >  
> > > +out_wakeup:
> > >      /* Now wakeup anybody waiting in flush for this request */
> > >      if (!qemu_co_queue_next(&pdu->complete)) {
> > >          pdu_free(pdu);
> > > diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
> > > index d1cfeaf..3c1b0b5 100644
> > > --- a/hw/9pfs/9p.h
> > > +++ b/hw/9pfs/9p.h
> > > @@ -365,6 +365,7 @@ struct V9fsTransport {
> > >      void        (*init_out_iov_from_pdu)(V9fsPDU *pdu, struct iovec **piov,
> > >                                           unsigned int *pniov, size_t size);
> > >      void        (*push_and_notify)(V9fsPDU *pdu);
> > > +    void        (*discard)(V9fsPDU *pdu);
> > >  };
> > >  
> > >  static inline int v9fs_register_transport(V9fsState *s,
> > > diff --git a/hw/9pfs/trace-events b/hw/9pfs/trace-events
> > > index 08a4abf..1aee350 100644
> > > --- a/hw/9pfs/trace-events
> > > +++ b/hw/9pfs/trace-events
> > > @@ -1,6 +1,7 @@
> > >  # See docs/devel/tracing.txt for syntax documentation.
> > >  
> > >  # hw/9pfs/virtio-9p.c
> > > +v9fs_rcancel(uint16_t tag, uint8_t id) "tag %d id %d"
> > >  v9fs_rerror(uint16_t tag, uint8_t id, int err) "tag %d id %d err %d"
> > >  v9fs_version(uint16_t tag, uint8_t id, int32_t msize, char* version) "tag %d id %d msize %d version %s"
> > >  v9fs_version_return(uint16_t tag, uint8_t id, int32_t msize, char* version) "tag %d id %d msize %d version %s"
> > > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > > index 62650b0..2510329 100644
> > > --- a/hw/9pfs/virtio-9p-device.c
> > > +++ b/hw/9pfs/virtio-9p-device.c
> > > @@ -37,6 +37,19 @@ static void virtio_9p_push_and_notify(V9fsPDU *pdu)
> > >      virtio_notify(VIRTIO_DEVICE(v), v->vq);
> > >  }
> > >  
> > > +
> > > +static void virtio_pdu_discard(V9fsPDU *pdu)
> > > +{
> > > +    V9fsState *s = pdu->s;
> > > +    V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
> > > +    VirtQueueElement *elem = v->elems[pdu->idx];
> > > +
> > > +    /* discard element from the queue */
> > > +    virtqueue_detach_element(v->vq, elem, pdu->size);
> > > +    g_free(elem);
> > > +    v->elems[pdu->idx] = NULL;
> > > +}
> > > +
> > >  static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
> > >  {
> > >      V9fsVirtioState *v = (V9fsVirtioState *)vdev;
> > > @@ -221,6 +234,7 @@ static const struct V9fsTransport virtio_9p_transport = {
> > >      .init_in_iov_from_pdu = virtio_init_in_iov_from_pdu,
> > >      .init_out_iov_from_pdu = virtio_init_out_iov_from_pdu,
> > >      .push_and_notify = virtio_9p_push_and_notify,
> > > +    .discard = virtio_pdu_discard,
> > >  };
> > >  
> > >  /* virtio-9p device */
> > > diff --git a/hw/9pfs/xen-9p-backend.c b/hw/9pfs/xen-9p-backend.c
> > > index ee87f08..7208ce6 100644
> > > --- a/hw/9pfs/xen-9p-backend.c
> > > +++ b/hw/9pfs/xen-9p-backend.c
> > > @@ -233,12 +233,23 @@ static void xen_9pfs_push_and_notify(V9fsPDU *pdu)
> > >      qemu_bh_schedule(ring->bh);
> > >  }
> > >  
> > > +static void xen_9pfs_discard(V9fsPDU *pdu)
> > > +{
> > > +    Xen9pfsDev *priv = container_of(pdu->s, Xen9pfsDev, state);
> > > +    Xen9pfsRing *ring = &priv->rings[pdu->tag % priv->num_rings];
> > > +
> > > +    g_free(ring->sg);
> > > +    ring->sg = NULL;
> > > +    ring->inprogress = false;

I think we also need to:

    /* this is required to consume the request */
    ring->intf->out_cons = ring->out_cons;
    xen_wmb();

    /* this is required to complete the cancellation of the pdu operation */
    ring->inprogress = false;

    /* this is required to let the other end know that there is more
     * space on the ring available */
    xenevtchn_notify(ring->evtchndev, ring->local_port);

    /* trigger the read of the next request */
    qemu_bh_schedule(ring->bh);


Which is basically almost everything else from xen_9pfs_push_and_notify,
except the modification of in_prod, because we are not going to write
back anything to the ring.



> > > +}
> > > +
> > >  static const struct V9fsTransport xen_9p_transport = {
> > >      .pdu_vmarshal = xen_9pfs_pdu_vmarshal,
> > >      .pdu_vunmarshal = xen_9pfs_pdu_vunmarshal,
> > >      .init_in_iov_from_pdu = xen_9pfs_init_in_iov_from_pdu,
> > >      .init_out_iov_from_pdu = xen_9pfs_init_out_iov_from_pdu,
> > >      .push_and_notify = xen_9pfs_push_and_notify,
> > > +    .discard = xen_9pfs_discard,
> > >  };
> > >  
> > >  static int xen_9pfs_init(struct XenDevice *xendev)
Greg Kurz Dec. 13, 2017, 1:57 p.m. | #4
On Tue, 12 Dec 2017 15:14:06 -0800 (PST)
Stefano Stabellini <sstabellini@kernel.org> wrote:

> On Fri, 8 Dec 2017, Greg Kurz wrote:
> > Cc'ing Stefano using a more appropriate address :)  
> 
> Thanks Greg for the ping, one comment inline below.
> 
[...]
> 
> I think we also need to:
> 
>     /* this is required to consume the request */
>     ring->intf->out_cons = ring->out_cons;
>     xen_wmb();
> 
>     /* this is required to complete the cancellation of the pdu operation */
>     ring->inprogress = false;
> 
>     /* this is required to let the other end know that there is more
>      * space on the ring available */
>     xenevtchn_notify(ring->evtchndev, ring->local_port);
> 

That makes me think that we should also probably notify the guest with
virtio since this isn't a reset... :)

>     /* trigger the read of the next request */
>     qemu_bh_schedule(ring->bh);
> 
> 
> Which is basically almost everything else from xen_9pfs_push_and_notify,
> except the modification of in_prod, because we are not going to write
> back anything to the ring.
> 

... in which case, the same statement stands for virtio_9pfs_push_and_notify()

We should probably unify push_and_notify and discard into a single operation:

-    void        (*push_and_notify)(V9fsPDU *pdu);
+    void        (*return_to_guest)(V9fsPDU *pdu, bool discard);

Let's do this in a separate patch, and respin this fix on top.

> 
> 
> > > > +}
> > > > +
> > > >  static const struct V9fsTransport xen_9p_transport = {
> > > >      .pdu_vmarshal = xen_9pfs_pdu_vmarshal,
> > > >      .pdu_vunmarshal = xen_9pfs_pdu_vunmarshal,
> > > >      .init_in_iov_from_pdu = xen_9pfs_init_in_iov_from_pdu,
> > > >      .init_out_iov_from_pdu = xen_9pfs_init_out_iov_from_pdu,
> > > >      .push_and_notify = xen_9pfs_push_and_notify,
> > > > +    .discard = xen_9pfs_discard,
> > > >  };
> > > >  
> > > >  static int xen_9pfs_init(struct XenDevice *xendev)    
>

Patch

diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index 710cd91..daa8519 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -648,6 +648,23 @@  static void coroutine_fn pdu_complete(V9fsPDU *pdu, ssize_t len)
     V9fsState *s = pdu->s;
     int ret;
 
+    /*
+     * The 9p spec requires that successfully cancelled pdus receive no reply.
+     * Sending a reply would confuse clients because they would
+     * assume that any EINTR is the actual result of the operation,
+     * rather than a consequence of the cancellation. However, if
+     * the operation completed (succesfully or with an error other
+     * than caused be cancellation), we do send out that reply, both
+     * for efficiency and to avoid confusing the rest of the state machine
+     * that assumes passing a non-error here will mean a successful
+     * transmission of the reply.
+     */
+    if (pdu->cancelled && len == -EINTR) {
+        trace_v9fs_rcancel(pdu->tag, pdu->id);
+        pdu->s->transport->discard(pdu);
+        goto out_wakeup;
+    }
+
     if (len < 0) {
         int err = -len;
         len = 7;
@@ -690,6 +707,7 @@  static void coroutine_fn pdu_complete(V9fsPDU *pdu, ssize_t len)
 out_notify:
     pdu->s->transport->push_and_notify(pdu);
 
+out_wakeup:
     /* Now wakeup anybody waiting in flush for this request */
     if (!qemu_co_queue_next(&pdu->complete)) {
         pdu_free(pdu);
diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
index d1cfeaf..3c1b0b5 100644
--- a/hw/9pfs/9p.h
+++ b/hw/9pfs/9p.h
@@ -365,6 +365,7 @@  struct V9fsTransport {
     void        (*init_out_iov_from_pdu)(V9fsPDU *pdu, struct iovec **piov,
                                          unsigned int *pniov, size_t size);
     void        (*push_and_notify)(V9fsPDU *pdu);
+    void        (*discard)(V9fsPDU *pdu);
 };
 
 static inline int v9fs_register_transport(V9fsState *s,
diff --git a/hw/9pfs/trace-events b/hw/9pfs/trace-events
index 08a4abf..1aee350 100644
--- a/hw/9pfs/trace-events
+++ b/hw/9pfs/trace-events
@@ -1,6 +1,7 @@ 
 # See docs/devel/tracing.txt for syntax documentation.
 
 # hw/9pfs/virtio-9p.c
+v9fs_rcancel(uint16_t tag, uint8_t id) "tag %d id %d"
 v9fs_rerror(uint16_t tag, uint8_t id, int err) "tag %d id %d err %d"
 v9fs_version(uint16_t tag, uint8_t id, int32_t msize, char* version) "tag %d id %d msize %d version %s"
 v9fs_version_return(uint16_t tag, uint8_t id, int32_t msize, char* version) "tag %d id %d msize %d version %s"
diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 62650b0..2510329 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -37,6 +37,19 @@  static void virtio_9p_push_and_notify(V9fsPDU *pdu)
     virtio_notify(VIRTIO_DEVICE(v), v->vq);
 }
 
+
+static void virtio_pdu_discard(V9fsPDU *pdu)
+{
+    V9fsState *s = pdu->s;
+    V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
+    VirtQueueElement *elem = v->elems[pdu->idx];
+
+    /* discard element from the queue */
+    virtqueue_detach_element(v->vq, elem, pdu->size);
+    g_free(elem);
+    v->elems[pdu->idx] = NULL;
+}
+
 static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     V9fsVirtioState *v = (V9fsVirtioState *)vdev;
@@ -221,6 +234,7 @@  static const struct V9fsTransport virtio_9p_transport = {
     .init_in_iov_from_pdu = virtio_init_in_iov_from_pdu,
     .init_out_iov_from_pdu = virtio_init_out_iov_from_pdu,
     .push_and_notify = virtio_9p_push_and_notify,
+    .discard = virtio_pdu_discard,
 };
 
 /* virtio-9p device */
diff --git a/hw/9pfs/xen-9p-backend.c b/hw/9pfs/xen-9p-backend.c
index ee87f08..7208ce6 100644
--- a/hw/9pfs/xen-9p-backend.c
+++ b/hw/9pfs/xen-9p-backend.c
@@ -233,12 +233,23 @@  static void xen_9pfs_push_and_notify(V9fsPDU *pdu)
     qemu_bh_schedule(ring->bh);
 }
 
+static void xen_9pfs_discard(V9fsPDU *pdu)
+{
+    Xen9pfsDev *priv = container_of(pdu->s, Xen9pfsDev, state);
+    Xen9pfsRing *ring = &priv->rings[pdu->tag % priv->num_rings];
+
+    g_free(ring->sg);
+    ring->sg = NULL;
+    ring->inprogress = false;
+}
+
 static const struct V9fsTransport xen_9p_transport = {
     .pdu_vmarshal = xen_9pfs_pdu_vmarshal,
     .pdu_vunmarshal = xen_9pfs_pdu_vunmarshal,
     .init_in_iov_from_pdu = xen_9pfs_init_in_iov_from_pdu,
     .init_out_iov_from_pdu = xen_9pfs_init_out_iov_from_pdu,
     .push_and_notify = xen_9pfs_push_and_notify,
+    .discard = xen_9pfs_discard,
 };
 
 static int xen_9pfs_init(struct XenDevice *xendev)