diff mbox series

[10/15] python/machine: Add support for AQMP backend

Message ID 20210917054047.2042843-11-jsnow@redhat.com
State New
Headers show
Series Switch iotests to using Async QMP | expand

Commit Message

John Snow Sept. 17, 2021, 5:40 a.m. UTC
To use the AQMP backend, Machine just needs to be a little more diligent
about what happens when closing a QMP connection. The operation is no
longer a freebie in the async world.

Because async QMP continues to check for messages asynchronously, it's
almost certainly likely that the loop will have exited due to EOF after
issuing the last 'quit' command. That error will ultimately be bubbled
up when attempting to close the QMP connection. The manager class here
then is free to discard it if it was expected.

Signed-off-by: John Snow <jsnow@redhat.com>

---

Yes, I regret that this class has become quite a dumping ground for
complexity around the exit path. It's in need of a refactor to help
separate out the exception handling and cleanup mechanisms from the
VM-related stuff, but it's not a priority to do that just yet -- but
it's on the list.

---
 python/qemu/machine/machine.py | 51 ++++++++++++++++++++++++++++++----
 1 file changed, 46 insertions(+), 5 deletions(-)

Comments

Hanna Czenczek Sept. 17, 2021, 2:16 p.m. UTC | #1
On 17.09.21 07:40, John Snow wrote:
> To use the AQMP backend, Machine just needs to be a little more diligent
> about what happens when closing a QMP connection. The operation is no
> longer a freebie in the async world.
>
> Because async QMP continues to check for messages asynchronously, it's
> almost certainly likely that the loop will have exited due to EOF after
> issuing the last 'quit' command. That error will ultimately be bubbled
> up when attempting to close the QMP connection. The manager class here
> then is free to discard it if it was expected.
>
> Signed-off-by: John Snow <jsnow@redhat.com>
>
> ---
>
> Yes, I regret that this class has become quite a dumping ground for
> complexity around the exit path. It's in need of a refactor to help
> separate out the exception handling and cleanup mechanisms from the
> VM-related stuff, but it's not a priority to do that just yet -- but
> it's on the list.
>
> ---
>   python/qemu/machine/machine.py | 51 ++++++++++++++++++++++++++++++----
>   1 file changed, 46 insertions(+), 5 deletions(-)
>
> diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py
> index 6e58d2f951..8f5a6649e5 100644
> --- a/python/qemu/machine/machine.py
> +++ b/python/qemu/machine/machine.py
> @@ -342,9 +342,15 @@ def _post_shutdown(self) -> None:
>           # Comprehensive reset for the failed launch case:
>           self._early_cleanup()
>   
> -        if self._qmp_connection:
> -            self._qmp.close()
> -            self._qmp_connection = None
> +        try:
> +            self._close_qmp_connection()
> +        except Exception as err:  # pylint: disable=broad-except
> +            LOG.warning(
> +                "Exception closing QMP connection: %s",
> +                str(err) if str(err) else type(err).__name__
> +            )
> +        finally:
> +            assert self._qmp_connection is None
>   
>           self._close_qemu_log_file()
>   
> @@ -420,6 +426,31 @@ def _launch(self) -> None:
>                                          close_fds=False)
>           self._post_launch()
>   
> +    def _close_qmp_connection(self) -> None:
> +        """
> +        Close the underlying QMP connection, if any.
> +
> +        Dutifully report errors that occurred while closing, but assume
> +        that any error encountered indicates an abnormal termination
> +        process and not a failure to close.
> +        """
> +        if self._qmp_connection is None:
> +            return
> +
> +        try:
> +            self._qmp.close()
> +        except EOFError:
> +            # EOF can occur as an Exception here when using the Async
> +            # QMP backend. It indicates that the server closed the
> +            # stream. If we successfully issued 'quit' at any point,
> +            # then this was expected. If the remote went away without
> +            # our permission, it's worth reporting that as an abnormal
> +            # shutdown case.
> +            if not self._has_quit:
> +                raise
> +        finally:
> +            self._qmp_connection = None
> +
>       def _early_cleanup(self) -> None:
>           """
>           Perform any cleanup that needs to happen before the VM exits.
> @@ -461,8 +492,18 @@ def _soft_shutdown(self, timeout: Optional[int]) -> None:
>   
>           if self._qmp_connection:
>               if not self._has_quit:
> -                # Might raise ConnectionReset
> -                self.qmp('quit')
> +                try:
> +                    # May raise ExecInterruptedError or StateError if the
> +                    # connection dies or has already died.
> +                    self.qmp('quit')
> +                finally:
> +                    # If 'quit' fails, we'll still want to call close(),
> +                    # which will raise any final errors that may have
> +                    # occurred while trying to send 'quit'.
> +                    self._close_qmp_connection()
> +            else:
> +                # Regardless, we want to tidy up the socket.
> +                self._close_qmp_connection()

Why can’t we wait for _post_shutdown to do that?  Has it something to do 
with this operation being “no longer a freeby” (I’m not quite sure what 
this refers to, execution time, or the set of possible exceptions, or 
perhaps something else entirely), and so we want to do it in the 
already-not-instantaneous _soft_shutdown?

Hanna

>   
>           # May raise subprocess.TimeoutExpired
>           self._subp.wait(timeout=timeout)
John Snow Sept. 17, 2021, 11:48 p.m. UTC | #2
On Fri, Sep 17, 2021 at 10:16 AM Hanna Reitz <hreitz@redhat.com> wrote:

> On 17.09.21 07:40, John Snow wrote:
> > To use the AQMP backend, Machine just needs to be a little more diligent
> > about what happens when closing a QMP connection. The operation is no
> > longer a freebie in the async world.
> >
> > Because async QMP continues to check for messages asynchronously, it's
> > almost certainly likely that the loop will have exited due to EOF after
> > issuing the last 'quit' command. That error will ultimately be bubbled
> > up when attempting to close the QMP connection. The manager class here
> > then is free to discard it if it was expected.
> >
> > Signed-off-by: John Snow <jsnow@redhat.com>
> >
> > ---
> >
> > Yes, I regret that this class has become quite a dumping ground for
> > complexity around the exit path. It's in need of a refactor to help
> > separate out the exception handling and cleanup mechanisms from the
> > VM-related stuff, but it's not a priority to do that just yet -- but
> > it's on the list.
> >
> > ---
> >   python/qemu/machine/machine.py | 51 ++++++++++++++++++++++++++++++----
> >   1 file changed, 46 insertions(+), 5 deletions(-)
> >
> > diff --git a/python/qemu/machine/machine.py
> b/python/qemu/machine/machine.py
> > index 6e58d2f951..8f5a6649e5 100644
> > --- a/python/qemu/machine/machine.py
> > +++ b/python/qemu/machine/machine.py
> > @@ -342,9 +342,15 @@ def _post_shutdown(self) -> None:
> >           # Comprehensive reset for the failed launch case:
> >           self._early_cleanup()
> >
> > -        if self._qmp_connection:
> > -            self._qmp.close()
> > -            self._qmp_connection = None
> > +        try:
> > +            self._close_qmp_connection()
> > +        except Exception as err:  # pylint: disable=broad-except
> > +            LOG.warning(
> > +                "Exception closing QMP connection: %s",
> > +                str(err) if str(err) else type(err).__name__
> > +            )
> > +        finally:
> > +            assert self._qmp_connection is None
> >
> >           self._close_qemu_log_file()
> >
> > @@ -420,6 +426,31 @@ def _launch(self) -> None:
> >                                          close_fds=False)
> >           self._post_launch()
> >
> > +    def _close_qmp_connection(self) -> None:
> > +        """
> > +        Close the underlying QMP connection, if any.
> > +
> > +        Dutifully report errors that occurred while closing, but assume
> > +        that any error encountered indicates an abnormal termination
> > +        process and not a failure to close.
> > +        """
> > +        if self._qmp_connection is None:
> > +            return
> > +
> > +        try:
> > +            self._qmp.close()
> > +        except EOFError:
> > +            # EOF can occur as an Exception here when using the Async
> > +            # QMP backend. It indicates that the server closed the
> > +            # stream. If we successfully issued 'quit' at any point,
> > +            # then this was expected. If the remote went away without
> > +            # our permission, it's worth reporting that as an abnormal
> > +            # shutdown case.
> > +            if not self._has_quit:
> > +                raise
> > +        finally:
> > +            self._qmp_connection = None
> > +
> >       def _early_cleanup(self) -> None:
> >           """
> >           Perform any cleanup that needs to happen before the VM exits.
> > @@ -461,8 +492,18 @@ def _soft_shutdown(self, timeout: Optional[int]) ->
> None:
> >
> >           if self._qmp_connection:
> >               if not self._has_quit:
> > -                # Might raise ConnectionReset
> > -                self.qmp('quit')
> > +                try:
> > +                    # May raise ExecInterruptedError or StateError if
> the
> > +                    # connection dies or has already died.
> > +                    self.qmp('quit')
> > +                finally:
> > +                    # If 'quit' fails, we'll still want to call close(),
> > +                    # which will raise any final errors that may have
> > +                    # occurred while trying to send 'quit'.
> > +                    self._close_qmp_connection()
> > +            else:
> > +                # Regardless, we want to tidy up the socket.
> > +                self._close_qmp_connection()
>
> Why can’t we wait for _post_shutdown to do that?  Has it something to do
> with this operation being “no longer a freeby” (I’m not quite sure what
> this refers to, execution time, or the set of possible exceptions, or
> perhaps something else entirely), and so we want to do it in the
> already-not-instantaneous _soft_shutdown?
>
> Hanna
>
>
I wrote that commit message too casually.

By "freebie", I meant that closing a simple sync socket does not have any
potential for raising an error, so we don't really have to worry about that
operation failing. The async machinery, by comparison, uses the
disconnection point as its synchronization point where it gathers the final
results from its various execution contexts (the reader, writer, and
disconnection tasks).

The likeliest error to see here would be something like EOFError for QEMU
hanging up before disconnect() was called. Other possible errors would be
stuff from deep in the internals of AQMP that very likely means there's a
bug I need to resolve -- so I was afraid of just wholesale silencing such
things. (Hence the logged warning in the _post_shutdown method. I genuinely
don't expect to see anything there, but I am paranoid and wanted to be
rock-solid sure that any problems are visible.) It is also possible that if
QEMU were to send a garbled and unprompted message after 'quit' succeeded
(For example, a malformed SHUTDOWN event) that this would also enqueue an
error to be shown here at this point.

One of the design points of AQMP is that calling QMPClient.disconnect() is
necessary to reset the client to a fully IDLE state to where it can be
re-used for a subsequent connection. It serves double-duty as both
disconnect *and* results gathering. So I wanted to make sure to call it
here while we still had the opportunity to report an "Abnormal Shutdown"
instead of potentially choking later during the post-shutdown and failing
to do resource cleanup.

Good: if shutdown() succeeds, you can rest well knowing that definitely
nothing weird happened.
Bad: There's a lot of complexity inherent in promising that.

Clear as mud?

--js
Hanna Czenczek Oct. 4, 2021, 9:43 a.m. UTC | #3
On 18.09.21 01:48, John Snow wrote:
>
>
> On Fri, Sep 17, 2021 at 10:16 AM Hanna Reitz <hreitz@redhat.com 
> <mailto:hreitz@redhat.com>> wrote:
>
>     On 17.09.21 07:40, John Snow wrote:
>     > To use the AQMP backend, Machine just needs to be a little more
>     diligent
>     > about what happens when closing a QMP connection. The operation
>     is no
>     > longer a freebie in the async world.
>     >
>     > Because async QMP continues to check for messages
>     asynchronously, it's
>     > almost certainly likely that the loop will have exited due to
>     EOF after
>     > issuing the last 'quit' command. That error will ultimately be
>     bubbled
>     > up when attempting to close the QMP connection. The manager
>     class here
>     > then is free to discard it if it was expected.
>     >
>     > Signed-off-by: John Snow <jsnow@redhat.com
>     <mailto:jsnow@redhat.com>>
>     >
>     > ---
>     >
>     > Yes, I regret that this class has become quite a dumping ground for
>     > complexity around the exit path. It's in need of a refactor to help
>     > separate out the exception handling and cleanup mechanisms from the
>     > VM-related stuff, but it's not a priority to do that just yet -- but
>     > it's on the list.
>     >
>     > ---
>     >   python/qemu/machine/machine.py | 51
>     ++++++++++++++++++++++++++++++----
>     >   1 file changed, 46 insertions(+), 5 deletions(-)
>     >
>     > diff --git a/python/qemu/machine/machine.py
>     b/python/qemu/machine/machine.py
>     > index 6e58d2f951..8f5a6649e5 100644
>     > --- a/python/qemu/machine/machine.py
>     > +++ b/python/qemu/machine/machine.py
>     > @@ -342,9 +342,15 @@ def _post_shutdown(self) -> None:
>     >           # Comprehensive reset for the failed launch case:
>     >           self._early_cleanup()
>     >
>     > -        if self._qmp_connection:
>     > -            self._qmp.close()
>     > -            self._qmp_connection = None
>     > +        try:
>     > +            self._close_qmp_connection()
>     > +        except Exception as err:  # pylint: disable=broad-except
>     > +            LOG.warning(
>     > +                "Exception closing QMP connection: %s",
>     > +                str(err) if str(err) else type(err).__name__
>     > +            )
>     > +        finally:
>     > +            assert self._qmp_connection is None
>     >
>     >           self._close_qemu_log_file()
>     >
>     > @@ -420,6 +426,31 @@ def _launch(self) -> None:
>     > close_fds=False)
>     >           self._post_launch()
>     >
>     > +    def _close_qmp_connection(self) -> None:
>     > +        """
>     > +        Close the underlying QMP connection, if any.
>     > +
>     > +        Dutifully report errors that occurred while closing,
>     but assume
>     > +        that any error encountered indicates an abnormal
>     termination
>     > +        process and not a failure to close.
>     > +        """
>     > +        if self._qmp_connection is None:
>     > +            return
>     > +
>     > +        try:
>     > +            self._qmp.close()
>     > +        except EOFError:
>     > +            # EOF can occur as an Exception here when using the
>     Async
>     > +            # QMP backend. It indicates that the server closed the
>     > +            # stream. If we successfully issued 'quit' at any
>     point,
>     > +            # then this was expected. If the remote went away
>     without
>     > +            # our permission, it's worth reporting that as an
>     abnormal
>     > +            # shutdown case.
>     > +            if not self._has_quit:
>     > +                raise
>     > +        finally:
>     > +            self._qmp_connection = None
>     > +
>     >       def _early_cleanup(self) -> None:
>     >           """
>     >           Perform any cleanup that needs to happen before the VM
>     exits.
>     > @@ -461,8 +492,18 @@ def _soft_shutdown(self, timeout:
>     Optional[int]) -> None:
>     >
>     >           if self._qmp_connection:
>     >               if not self._has_quit:
>     > -                # Might raise ConnectionReset
>     > -                self.qmp('quit')
>     > +                try:
>     > +                    # May raise ExecInterruptedError or
>     StateError if the
>     > +                    # connection dies or has already died.
>     > +                    self.qmp('quit')
>     > +                finally:
>     > +                    # If 'quit' fails, we'll still want to call
>     close(),
>     > +                    # which will raise any final errors that
>     may have
>     > +                    # occurred while trying to send 'quit'.
>     > +                    self._close_qmp_connection()
>     > +            else:
>     > +                # Regardless, we want to tidy up the socket.
>     > +                self._close_qmp_connection()
>
>     Why can’t we wait for _post_shutdown to do that?  Has it something
>     to do
>     with this operation being “no longer a freeby” (I’m not quite sure
>     what
>     this refers to, execution time, or the set of possible exceptions, or
>     perhaps something else entirely), and so we want to do it in the
>     already-not-instantaneous _soft_shutdown?
>
>     Hanna
>
>
> I wrote that commit message too casually.
>
> By "freebie", I meant that closing a simple sync socket does not have 
> any potential for raising an error, so we don't really have to worry 
> about that operation failing. The async machinery, by comparison, uses 
> the disconnection point as its synchronization point where it gathers 
> the final results from its various execution contexts (the reader, 
> writer, and disconnection tasks).
>
> The likeliest error to see here would be something like EOFError for 
> QEMU hanging up before disconnect() was called. Other possible errors 
> would be stuff from deep in the internals of AQMP that very likely 
> means there's a bug I need to resolve -- so I was afraid of just 
> wholesale silencing such things. (Hence the logged warning in the 
> _post_shutdown method. I genuinely don't expect to see anything there, 
> but I am paranoid and wanted to be rock-solid sure that any problems 
> are visible.) It is also possible that if QEMU were to send a garbled 
> and unprompted message after 'quit' succeeded (For example, a 
> malformed SHUTDOWN event) that this would also enqueue an error to be 
> shown here at this point.
>
> One of the design points of AQMP is that calling 
> QMPClient.disconnect() is necessary to reset the client to a fully 
> IDLE state to where it can be re-used for a subsequent connection. It 
> serves double-duty as both disconnect *and* results gathering. So I 
> wanted to make sure to call it here while we still had the opportunity 
> to report an "Abnormal Shutdown" instead of potentially choking later 
> during the post-shutdown and failing to do resource cleanup.
>
> Good: if shutdown() succeeds, you can rest well knowing that 
> definitely nothing weird happened.
> Bad: There's a lot of complexity inherent in promising that.
>
> Clear as mud?

Yep, that makes sense.  Thanks!

(Seems like you intend to send a v2, so I think I’ll refrain from giving 
a rather haphazard R-b now, because two weeks later I forgot whether I 
was originally confident enough to give one were it now for that bit 
here...)

Hanna
diff mbox series

Patch

diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py
index 6e58d2f951..8f5a6649e5 100644
--- a/python/qemu/machine/machine.py
+++ b/python/qemu/machine/machine.py
@@ -342,9 +342,15 @@  def _post_shutdown(self) -> None:
         # Comprehensive reset for the failed launch case:
         self._early_cleanup()
 
-        if self._qmp_connection:
-            self._qmp.close()
-            self._qmp_connection = None
+        try:
+            self._close_qmp_connection()
+        except Exception as err:  # pylint: disable=broad-except
+            LOG.warning(
+                "Exception closing QMP connection: %s",
+                str(err) if str(err) else type(err).__name__
+            )
+        finally:
+            assert self._qmp_connection is None
 
         self._close_qemu_log_file()
 
@@ -420,6 +426,31 @@  def _launch(self) -> None:
                                        close_fds=False)
         self._post_launch()
 
+    def _close_qmp_connection(self) -> None:
+        """
+        Close the underlying QMP connection, if any.
+
+        Dutifully report errors that occurred while closing, but assume
+        that any error encountered indicates an abnormal termination
+        process and not a failure to close.
+        """
+        if self._qmp_connection is None:
+            return
+
+        try:
+            self._qmp.close()
+        except EOFError:
+            # EOF can occur as an Exception here when using the Async
+            # QMP backend. It indicates that the server closed the
+            # stream. If we successfully issued 'quit' at any point,
+            # then this was expected. If the remote went away without
+            # our permission, it's worth reporting that as an abnormal
+            # shutdown case.
+            if not self._has_quit:
+                raise
+        finally:
+            self._qmp_connection = None
+
     def _early_cleanup(self) -> None:
         """
         Perform any cleanup that needs to happen before the VM exits.
@@ -461,8 +492,18 @@  def _soft_shutdown(self, timeout: Optional[int]) -> None:
 
         if self._qmp_connection:
             if not self._has_quit:
-                # Might raise ConnectionReset
-                self.qmp('quit')
+                try:
+                    # May raise ExecInterruptedError or StateError if the
+                    # connection dies or has already died.
+                    self.qmp('quit')
+                finally:
+                    # If 'quit' fails, we'll still want to call close(),
+                    # which will raise any final errors that may have
+                    # occurred while trying to send 'quit'.
+                    self._close_qmp_connection()
+            else:
+                # Regardless, we want to tidy up the socket.
+                self._close_qmp_connection()
 
         # May raise subprocess.TimeoutExpired
         self._subp.wait(timeout=timeout)