diff mbox

[v3,1/3] qemu.py: fix is_running()

Message ID 20170720091902.22476-2-apahim@redhat.com
State New
Headers show

Commit Message

Amador Pahim July 20, 2017, 9:19 a.m. UTC
Current implementation is broken. It does not really test if the child
process is running.

The Popen.returncode will only be set after by a poll(), wait() or
communicate(). If the Popen fails to launch a VM, the Popen.returncode
will not turn to None by itself.

Instead of using Popen.returncode, let's use Popen.poll(), which
actually checks if child process has terminated.

Signed-off-by: Amador Pahim <apahim@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
---
 scripts/qemu.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Markus Armbruster July 20, 2017, 11:49 a.m. UTC | #1
Amador Pahim <apahim@redhat.com> writes:

> Current implementation is broken. It does not really test if the child
> process is running.

What usage exactly is broken by this?  Got a reproducer for me?

> The Popen.returncode will only be set after by a poll(), wait() or
> communicate(). If the Popen fails to launch a VM, the Popen.returncode
> will not turn to None by itself.

Hmm.  What is the value of .returncode then?

> Instead of using Popen.returncode, let's use Popen.poll(), which
> actually checks if child process has terminated.
>
> Signed-off-by: Amador Pahim <apahim@redhat.com>
> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
> Reviewed-by: Fam Zheng <famz@redhat.com>
> ---
>  scripts/qemu.py | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/scripts/qemu.py b/scripts/qemu.py
> index 880e3e8219..f0fade32bd 100644
> --- a/scripts/qemu.py
> +++ b/scripts/qemu.py
> @@ -86,7 +86,7 @@ class QEMUMachine(object):
>              raise
>  
>      def is_running(self):
> -        return self._popen and (self._popen.returncode is None)
> +        return self._popen and (self._popen.poll() is None)
>  
>      def exitcode(self):
>          if self._popen is None:
               return None
           return self._popen.returncode

Why is this one safe?
Amador Pahim July 20, 2017, 12:57 p.m. UTC | #2
On Thu, Jul 20, 2017 at 1:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
> Amador Pahim <apahim@redhat.com> writes:
>
>> Current implementation is broken. It does not really test if the child
>> process is running.
>
> What usage exactly is broken by this?  Got a reproducer for me?

Problem is that 'returncode' is not set without a calling
poll()/wait()/communicate(), so it's only useful to test if the
process is running after such calls. But if we use 'poll()' instead,
it will, according to the docs, "Check if child process has
terminated. Set and return returncode attribute."

Reproducer is:

 >>> import subprocess
 >>> devnull = open('/dev/null', 'rb')
 >>> p = subprocess.Popen(['qemu-system-x86_64', '-broken'],
stdin=devnull, stdout=devnull, stderr=devnull, shell=False)
 >>> print p.returncode
 None
 >>> print p.poll()
 1
 >>> print p.returncode
 1


>
>> The Popen.returncode will only be set after by a poll(), wait() or
>> communicate(). If the Popen fails to launch a VM, the Popen.returncode
>> will not turn to None by itself.
>
> Hmm.  What is the value of .returncode then?

returncode starts with None and becomes the process exit code when the
process is over and one of that three methods is called (poll(),
wait() or communicate()).

There's an error in my description though. The correct would be: "The
Popen.returncode will only be set after a call to poll(), wait() or
communicate(). If the Popen fails to launch a VM, the Popen.returncode
will not turn from None to the actual return code by itself."


>
>> Instead of using Popen.returncode, let's use Popen.poll(), which
>> actually checks if child process has terminated.
>>
>> Signed-off-by: Amador Pahim <apahim@redhat.com>
>> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
>> Reviewed-by: Fam Zheng <famz@redhat.com>
>> ---
>>  scripts/qemu.py | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/scripts/qemu.py b/scripts/qemu.py
>> index 880e3e8219..f0fade32bd 100644
>> --- a/scripts/qemu.py
>> +++ b/scripts/qemu.py
>> @@ -86,7 +86,7 @@ class QEMUMachine(object):
>>              raise
>>
>>      def is_running(self):
>> -        return self._popen and (self._popen.returncode is None)
>> +        return self._popen and (self._popen.poll() is None)
>>
>>      def exitcode(self):
>>          if self._popen is None:
>                return None
>            return self._popen.returncode
>
> Why is this one safe?

Here it's used just to retrieve the value from the Popen.returncode.
It's not being used to check whether the process is running or not.
Markus Armbruster July 20, 2017, 3:09 p.m. UTC | #3
Amador Pahim <apahim@redhat.com> writes:

> On Thu, Jul 20, 2017 at 1:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
>> Amador Pahim <apahim@redhat.com> writes:
>>
>>> Current implementation is broken. It does not really test if the child
>>> process is running.
>>
>> What usage exactly is broken by this?  Got a reproducer for me?
>
> Problem is that 'returncode' is not set without a calling
> poll()/wait()/communicate(), so it's only useful to test if the
> process is running after such calls. But if we use 'poll()' instead,
> it will, according to the docs, "Check if child process has
> terminated. Set and return returncode attribute."
>
> Reproducer is:
>
>  >>> import subprocess
>  >>> devnull = open('/dev/null', 'rb')
>  >>> p = subprocess.Popen(['qemu-system-x86_64', '-broken'],
> stdin=devnull, stdout=devnull, stderr=devnull, shell=False)
>  >>> print p.returncode
>  None
>  >>> print p.poll()
>  1
>  >>> print p.returncode
>  1
>
>>> The Popen.returncode will only be set after by a poll(), wait() or
>>> communicate(). If the Popen fails to launch a VM, the Popen.returncode
>>> will not turn to None by itself.
>>
>> Hmm.  What is the value of .returncode then?
>
> returncode starts with None and becomes the process exit code when the
> process is over and one of that three methods is called (poll(),
> wait() or communicate()).
>
> There's an error in my description though. The correct would be: "The
> Popen.returncode will only be set after a call to poll(), wait() or
> communicate(). If the Popen fails to launch a VM, the Popen.returncode
> will not turn from None to the actual return code by itself."

Suggest to add ", and is_running() continues to report True".

>>> Instead of using Popen.returncode, let's use Popen.poll(), which
>>> actually checks if child process has terminated.
>>>
>>> Signed-off-by: Amador Pahim <apahim@redhat.com>
>>> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
>>> Reviewed-by: Fam Zheng <famz@redhat.com>
>>> ---
>>>  scripts/qemu.py | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/scripts/qemu.py b/scripts/qemu.py
>>> index 880e3e8219..f0fade32bd 100644
>>> --- a/scripts/qemu.py
>>> +++ b/scripts/qemu.py
>>> @@ -86,7 +86,7 @@ class QEMUMachine(object):
>>>              raise
>>>
>>>      def is_running(self):
>>> -        return self._popen and (self._popen.returncode is None)
>>> +        return self._popen and (self._popen.poll() is None)
>>>
>>>      def exitcode(self):
>>>          if self._popen is None:
>>                return None
>>            return self._popen.returncode
>>
>> Why is this one safe?
>
> Here it's used just to retrieve the value from the Popen.returncode.
> It's not being used to check whether the process is running or not.

If self._popen is not None, we return self._popen.returncode.  It's None
if .poll() etc. haven't been called.  Can this happen?  If not, why not?
If yes, why is returning None then okay?
Amador Pahim July 20, 2017, 3:46 p.m. UTC | #4
On Thu, Jul 20, 2017 at 5:09 PM, Markus Armbruster <armbru@redhat.com> wrote:
> Amador Pahim <apahim@redhat.com> writes:
>
>> On Thu, Jul 20, 2017 at 1:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
>>> Amador Pahim <apahim@redhat.com> writes:
>>>
>>>> Current implementation is broken. It does not really test if the child
>>>> process is running.
>>>
>>> What usage exactly is broken by this?  Got a reproducer for me?
>>
>> Problem is that 'returncode' is not set without a calling
>> poll()/wait()/communicate(), so it's only useful to test if the
>> process is running after such calls. But if we use 'poll()' instead,
>> it will, according to the docs, "Check if child process has
>> terminated. Set and return returncode attribute."
>>
>> Reproducer is:
>>
>>  >>> import subprocess
>>  >>> devnull = open('/dev/null', 'rb')
>>  >>> p = subprocess.Popen(['qemu-system-x86_64', '-broken'],
>> stdin=devnull, stdout=devnull, stderr=devnull, shell=False)
>>  >>> print p.returncode
>>  None
>>  >>> print p.poll()
>>  1
>>  >>> print p.returncode
>>  1
>>
>>>> The Popen.returncode will only be set after by a poll(), wait() or
>>>> communicate(). If the Popen fails to launch a VM, the Popen.returncode
>>>> will not turn to None by itself.
>>>
>>> Hmm.  What is the value of .returncode then?
>>
>> returncode starts with None and becomes the process exit code when the
>> process is over and one of that three methods is called (poll(),
>> wait() or communicate()).
>>
>> There's an error in my description though. The correct would be: "The
>> Popen.returncode will only be set after a call to poll(), wait() or
>> communicate(). If the Popen fails to launch a VM, the Popen.returncode
>> will not turn from None to the actual return code by itself."
>
> Suggest to add ", and is_running() continues to report True".
>
>>>> Instead of using Popen.returncode, let's use Popen.poll(), which
>>>> actually checks if child process has terminated.
>>>>
>>>> Signed-off-by: Amador Pahim <apahim@redhat.com>
>>>> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
>>>> Reviewed-by: Fam Zheng <famz@redhat.com>
>>>> ---
>>>>  scripts/qemu.py | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/scripts/qemu.py b/scripts/qemu.py
>>>> index 880e3e8219..f0fade32bd 100644
>>>> --- a/scripts/qemu.py
>>>> +++ b/scripts/qemu.py
>>>> @@ -86,7 +86,7 @@ class QEMUMachine(object):
>>>>              raise
>>>>
>>>>      def is_running(self):
>>>> -        return self._popen and (self._popen.returncode is None)
>>>> +        return self._popen and (self._popen.poll() is None)
>>>>
>>>>      def exitcode(self):
>>>>          if self._popen is None:
>>>                return None
>>>            return self._popen.returncode
>>>
>>> Why is this one safe?
>>
>> Here it's used just to retrieve the value from the Popen.returncode.
>> It's not being used to check whether the process is running or not.
>
> If self._popen is not None, we return self._popen.returncode.  It's None
> if .poll() etc. haven't been called.  Can this happen?  If not, why not?
> If yes, why is returning None then okay?

Yes, that can happen. This method is not returning an up-to-date
returncode, it's serving just as a wrapper to the attribute, being the
attribute updated or not.
I lack the background on why it was coded that way, but considering
the API-user perspective, I agree with you. We should return
self._popen.poll() here indeed. Fixing that.
Eduardo Habkost July 20, 2017, 5:49 p.m. UTC | #5
On Thu, Jul 20, 2017 at 05:09:11PM +0200, Markus Armbruster wrote:
> Amador Pahim <apahim@redhat.com> writes:
> 
> > On Thu, Jul 20, 2017 at 1:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
> >> Amador Pahim <apahim@redhat.com> writes:
> >>
> >>> Current implementation is broken. It does not really test if the child
> >>> process is running.
> >>
> >> What usage exactly is broken by this?  Got a reproducer for me?
> >
> > Problem is that 'returncode' is not set without a calling
> > poll()/wait()/communicate(), so it's only useful to test if the
> > process is running after such calls. But if we use 'poll()' instead,
> > it will, according to the docs, "Check if child process has
> > terminated. Set and return returncode attribute."
> >
> > Reproducer is:
> >
> >  >>> import subprocess
> >  >>> devnull = open('/dev/null', 'rb')
> >  >>> p = subprocess.Popen(['qemu-system-x86_64', '-broken'],
> > stdin=devnull, stdout=devnull, stderr=devnull, shell=False)
> >  >>> print p.returncode
> >  None
> >  >>> print p.poll()
> >  1
> >  >>> print p.returncode
> >  1
> >
> >>> The Popen.returncode will only be set after by a poll(), wait() or
> >>> communicate(). If the Popen fails to launch a VM, the Popen.returncode
> >>> will not turn to None by itself.
> >>
> >> Hmm.  What is the value of .returncode then?
> >
> > returncode starts with None and becomes the process exit code when the
> > process is over and one of that three methods is called (poll(),
> > wait() or communicate()).
> >
> > There's an error in my description though. The correct would be: "The
> > Popen.returncode will only be set after a call to poll(), wait() or
> > communicate(). If the Popen fails to launch a VM, the Popen.returncode
> > will not turn from None to the actual return code by itself."
> 
> Suggest to add ", and is_running() continues to report True".
> 
> >>> Instead of using Popen.returncode, let's use Popen.poll(), which
> >>> actually checks if child process has terminated.
> >>>
> >>> Signed-off-by: Amador Pahim <apahim@redhat.com>
> >>> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
> >>> Reviewed-by: Fam Zheng <famz@redhat.com>
> >>> ---
> >>>  scripts/qemu.py | 2 +-
> >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/scripts/qemu.py b/scripts/qemu.py
> >>> index 880e3e8219..f0fade32bd 100644
> >>> --- a/scripts/qemu.py
> >>> +++ b/scripts/qemu.py
> >>> @@ -86,7 +86,7 @@ class QEMUMachine(object):
> >>>              raise
> >>>
> >>>      def is_running(self):
> >>> -        return self._popen and (self._popen.returncode is None)
> >>> +        return self._popen and (self._popen.poll() is None)
> >>>

After re-reading shutdown(), I think this is _not_ OK: if
is_running() return False before we call .wait(), we will never
load the log file or run _post_shutdown() if QEMU exits between
the launch() and shutdown() calls.

Yes, it's fragile.

The root problem on both launch() and shutdown() seems to be
coupling the external "is QEMU running?" state with the internal
"did we load the log file and ran _post_shutdown() already?"
state.

I see two possible approaches for this:

1) Benefit from the fact that the internal Popen state will not
   change under our feet unless we explicitly call
   poll()/wait()/etc, and keep the existing code.  (Not my
   favorite option)

2) Rewrite the code so that we don't depend on the subtle Popen
   internal state rules, and track our own internal state in
   a QEMUMachine attribute.  e.g.:

    def _handle_shutdown(self):
        '''Load log file and call _post_shutdown() hook if necessary'''
        # Must be called only after QEMU actually exited.
        assert not self.is_running()
        if self._shutdown_pending:
            if self.exitcode() < 0:
                sys.stderr.write('qemu received signal %i: %s\n' % (-exitcode, ' '.join(self._args)))
            self._load_io_log()
            self._post_shutdown()
            self._shutdown_pending = False

    def _terminate(self):
        '''Terminate QEMU if it's still running'''
        if self.is_running():
            try:
                self._qmp.cmd('quit')
                self._qmp.close()
            except:
                self._popen.kill()
                self._popen.wait()

    def _launch(self):
        '''Launch the VM and establish a QMP connection'''
        devnull = open('/dev/null', 'rb')
        qemulog = open(self._qemu_log_path, 'wb')
        self._shutdown_pending = True
        self._pre_launch()
        args = self._wrapper + [self._binary] + self._base_args() + self._args
        self._popen = subprocess.Popen(args, stdin=devnull, stdout=qemulog,
                                       stderr=subprocess.STDOUT, shell=False)
        self._post_launch()

    def launch(self):
        try:
            self._launch()
        except:
            self._terminate()
            self._handle_shutdown()
            raise

    def shutdown(self):
        '''Terminate the VM and clean up'''
        self._terminate()
        self._handle_shutdown()


Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>

> >>>      def exitcode(self):
> >>>          if self._popen is None:
> >>                return None
> >>            return self._popen.returncode
> >>
> >> Why is this one safe?
> >
> > Here it's used just to retrieve the value from the Popen.returncode.
> > It's not being used to check whether the process is running or not.
> 
> If self._popen is not None, we return self._popen.returncode.  It's None
> if .poll() etc. haven't been called.  Can this happen?  If not, why not?
> If yes, why is returning None then okay?

It can't happen because the only caller of exitcode()
(device-crash-test) calls it immediately after shutdown().  But
it would be nice to make exitcode() behavior consistent with
is_running().
Amador Pahim July 20, 2017, 8:14 p.m. UTC | #6
On Thu, Jul 20, 2017 at 7:49 PM, Eduardo Habkost <ehabkost@redhat.com> wrote:
> On Thu, Jul 20, 2017 at 05:09:11PM +0200, Markus Armbruster wrote:
>> Amador Pahim <apahim@redhat.com> writes:
>>
>> > On Thu, Jul 20, 2017 at 1:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
>> >> Amador Pahim <apahim@redhat.com> writes:
>> >>
>> >>> Current implementation is broken. It does not really test if the child
>> >>> process is running.
>> >>
>> >> What usage exactly is broken by this?  Got a reproducer for me?
>> >
>> > Problem is that 'returncode' is not set without a calling
>> > poll()/wait()/communicate(), so it's only useful to test if the
>> > process is running after such calls. But if we use 'poll()' instead,
>> > it will, according to the docs, "Check if child process has
>> > terminated. Set and return returncode attribute."
>> >
>> > Reproducer is:
>> >
>> >  >>> import subprocess
>> >  >>> devnull = open('/dev/null', 'rb')
>> >  >>> p = subprocess.Popen(['qemu-system-x86_64', '-broken'],
>> > stdin=devnull, stdout=devnull, stderr=devnull, shell=False)
>> >  >>> print p.returncode
>> >  None
>> >  >>> print p.poll()
>> >  1
>> >  >>> print p.returncode
>> >  1
>> >
>> >>> The Popen.returncode will only be set after by a poll(), wait() or
>> >>> communicate(). If the Popen fails to launch a VM, the Popen.returncode
>> >>> will not turn to None by itself.
>> >>
>> >> Hmm.  What is the value of .returncode then?
>> >
>> > returncode starts with None and becomes the process exit code when the
>> > process is over and one of that three methods is called (poll(),
>> > wait() or communicate()).
>> >
>> > There's an error in my description though. The correct would be: "The
>> > Popen.returncode will only be set after a call to poll(), wait() or
>> > communicate(). If the Popen fails to launch a VM, the Popen.returncode
>> > will not turn from None to the actual return code by itself."
>>
>> Suggest to add ", and is_running() continues to report True".
>>
>> >>> Instead of using Popen.returncode, let's use Popen.poll(), which
>> >>> actually checks if child process has terminated.
>> >>>
>> >>> Signed-off-by: Amador Pahim <apahim@redhat.com>
>> >>> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
>> >>> Reviewed-by: Fam Zheng <famz@redhat.com>
>> >>> ---
>> >>>  scripts/qemu.py | 2 +-
>> >>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >>>
>> >>> diff --git a/scripts/qemu.py b/scripts/qemu.py
>> >>> index 880e3e8219..f0fade32bd 100644
>> >>> --- a/scripts/qemu.py
>> >>> +++ b/scripts/qemu.py
>> >>> @@ -86,7 +86,7 @@ class QEMUMachine(object):
>> >>>              raise
>> >>>
>> >>>      def is_running(self):
>> >>> -        return self._popen and (self._popen.returncode is None)
>> >>> +        return self._popen and (self._popen.poll() is None)
>> >>>
>
> After re-reading shutdown(), I think this is _not_ OK: if
> is_running() return False before we call .wait(), we will never
> load the log file or run _post_shutdown() if QEMU exits between
> the launch() and shutdown() calls.

Yes, I just noticed that while cleaning up the code.

>
> Yes, it's fragile.
>
> The root problem on both launch() and shutdown() seems to be
> coupling the external "is QEMU running?" state with the internal
> "did we load the log file and ran _post_shutdown() already?"
> state.
>
> I see two possible approaches for this:
>
> 1) Benefit from the fact that the internal Popen state will not
>    change under our feet unless we explicitly call
>    poll()/wait()/etc, and keep the existing code.  (Not my
>    favorite option)
>
> 2) Rewrite the code so that we don't depend on the subtle Popen
>    internal state rules, and track our own internal state in
>    a QEMUMachine attribute.  e.g.:

+1 for this approach. I'm working on something similar, thanks for the
detailed "e.g." code here.

>
>     def _handle_shutdown(self):
>         '''Load log file and call _post_shutdown() hook if necessary'''
>         # Must be called only after QEMU actually exited.
>         assert not self.is_running()
>         if self._shutdown_pending:
>             if self.exitcode() < 0:
>                 sys.stderr.write('qemu received signal %i: %s\n' % (-exitcode, ' '.join(self._args)))
>             self._load_io_log()
>             self._post_shutdown()
>             self._shutdown_pending = False
>
>     def _terminate(self):
>         '''Terminate QEMU if it's still running'''
>         if self.is_running():
>             try:
>                 self._qmp.cmd('quit')
>                 self._qmp.close()
>             except:
>                 self._popen.kill()
>                 self._popen.wait()
>
>     def _launch(self):
>         '''Launch the VM and establish a QMP connection'''
>         devnull = open('/dev/null', 'rb')
>         qemulog = open(self._qemu_log_path, 'wb')
>         self._shutdown_pending = True
>         self._pre_launch()
>         args = self._wrapper + [self._binary] + self._base_args() + self._args
>         self._popen = subprocess.Popen(args, stdin=devnull, stdout=qemulog,
>                                        stderr=subprocess.STDOUT, shell=False)
>         self._post_launch()
>
>     def launch(self):
>         try:
>             self._launch()
>         except:
>             self._terminate()
>             self._handle_shutdown()
>             raise
>
>     def shutdown(self):
>         '''Terminate the VM and clean up'''
>         self._terminate()
>         self._handle_shutdown()
>
>
> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
>
>> >>>      def exitcode(self):
>> >>>          if self._popen is None:
>> >>                return None
>> >>            return self._popen.returncode
>> >>
>> >> Why is this one safe?
>> >
>> > Here it's used just to retrieve the value from the Popen.returncode.
>> > It's not being used to check whether the process is running or not.
>>
>> If self._popen is not None, we return self._popen.returncode.  It's None
>> if .poll() etc. haven't been called.  Can this happen?  If not, why not?
>> If yes, why is returning None then okay?
>
> It can't happen because the only caller of exitcode()
> (device-crash-test) calls it immediately after shutdown().  But
> it would be nice to make exitcode() behavior consistent with
> is_running().
>
> --
> Eduardo
Lukáš Doktor July 21, 2017, 7:34 a.m. UTC | #7
Dne 20.7.2017 v 22:14 Amador Pahim napsal(a):
> On Thu, Jul 20, 2017 at 7:49 PM, Eduardo Habkost <ehabkost@redhat.com> wrote:
>> On Thu, Jul 20, 2017 at 05:09:11PM +0200, Markus Armbruster wrote:
>>> Amador Pahim <apahim@redhat.com> writes:
>>>
>>>> On Thu, Jul 20, 2017 at 1:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
>>>>> Amador Pahim <apahim@redhat.com> writes:
>>>>>
>>>>>> Current implementation is broken. It does not really test if the child
>>>>>> process is running.
>>>>>
>>>>> What usage exactly is broken by this?  Got a reproducer for me?
>>>>
>>>> Problem is that 'returncode' is not set without a calling
>>>> poll()/wait()/communicate(), so it's only useful to test if the
>>>> process is running after such calls. But if we use 'poll()' instead,
>>>> it will, according to the docs, "Check if child process has
>>>> terminated. Set and return returncode attribute."
>>>>
>>>> Reproducer is:
>>>>
>>>>  >>> import subprocess
>>>>  >>> devnull = open('/dev/null', 'rb')
>>>>  >>> p = subprocess.Popen(['qemu-system-x86_64', '-broken'],
>>>> stdin=devnull, stdout=devnull, stderr=devnull, shell=False)
>>>>  >>> print p.returncode
>>>>  None
>>>>  >>> print p.poll()
>>>>  1
>>>>  >>> print p.returncode
>>>>  1
>>>>
>>>>>> The Popen.returncode will only be set after by a poll(), wait() or
>>>>>> communicate(). If the Popen fails to launch a VM, the Popen.returncode
>>>>>> will not turn to None by itself.
>>>>>
>>>>> Hmm.  What is the value of .returncode then?
>>>>
>>>> returncode starts with None and becomes the process exit code when the
>>>> process is over and one of that three methods is called (poll(),
>>>> wait() or communicate()).
>>>>
>>>> There's an error in my description though. The correct would be: "The
>>>> Popen.returncode will only be set after a call to poll(), wait() or
>>>> communicate(). If the Popen fails to launch a VM, the Popen.returncode
>>>> will not turn from None to the actual return code by itself."
>>>
>>> Suggest to add ", and is_running() continues to report True".
>>>
>>>>>> Instead of using Popen.returncode, let's use Popen.poll(), which
>>>>>> actually checks if child process has terminated.
>>>>>>
>>>>>> Signed-off-by: Amador Pahim <apahim@redhat.com>
>>>>>> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
>>>>>> Reviewed-by: Fam Zheng <famz@redhat.com>
>>>>>> ---
>>>>>>  scripts/qemu.py | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/scripts/qemu.py b/scripts/qemu.py
>>>>>> index 880e3e8219..f0fade32bd 100644
>>>>>> --- a/scripts/qemu.py
>>>>>> +++ b/scripts/qemu.py
>>>>>> @@ -86,7 +86,7 @@ class QEMUMachine(object):
>>>>>>              raise
>>>>>>
>>>>>>      def is_running(self):
>>>>>> -        return self._popen and (self._popen.returncode is None)
>>>>>> +        return self._popen and (self._popen.poll() is None)
>>>>>>
>>
>> After re-reading shutdown(), I think this is _not_ OK: if
>> is_running() return False before we call .wait(), we will never
>> load the log file or run _post_shutdown() if QEMU exits between
>> the launch() and shutdown() calls.
> 
> Yes, I just noticed that while cleaning up the code.
> 
>>
>> Yes, it's fragile.
>>
>> The root problem on both launch() and shutdown() seems to be
>> coupling the external "is QEMU running?" state with the internal
>> "did we load the log file and ran _post_shutdown() already?"
>> state.
>>
>> I see two possible approaches for this:
>>
>> 1) Benefit from the fact that the internal Popen state will not
>>    change under our feet unless we explicitly call
>>    poll()/wait()/etc, and keep the existing code.  (Not my
>>    favorite option)
>>
>> 2) Rewrite the code so that we don't depend on the subtle Popen
>>    internal state rules, and track our own internal state in
>>    a QEMUMachine attribute.  e.g.:
> 
> +1 for this approach. I'm working on something similar, thanks for the
> detailed "e.g." code here.
> 
>>
>>     def _handle_shutdown(self):
>>         '''Load log file and call _post_shutdown() hook if necessary'''
>>         # Must be called only after QEMU actually exited.
>>         assert not self.is_running()
>>         if self._shutdown_pending:
>>             if self.exitcode() < 0:
>>                 sys.stderr.write('qemu received signal %i: %s\n' % (-exitcode, ' '.join(self._args)))
>>             self._load_io_log()
>>             self._post_shutdown()
>>             self._shutdown_pending = False
>>
>>     def _terminate(self):
>>         '''Terminate QEMU if it's still running'''
>>         if self.is_running():
>>             try:
>>                 self._qmp.cmd('quit')
>>                 self._qmp.close()
>>             except:
>>                 self._popen.kill()
>>                 self._popen.wait()
>>
>>     def _launch(self):
>>         '''Launch the VM and establish a QMP connection'''
>>         devnull = open('/dev/null', 'rb')
>>         qemulog = open(self._qemu_log_path, 'wb')
>>         self._shutdown_pending = True
>>         self._pre_launch()
>>         args = self._wrapper + [self._binary] + self._base_args() + self._args
>>         self._popen = subprocess.Popen(args, stdin=devnull, stdout=qemulog,
>>                                        stderr=subprocess.STDOUT, shell=False)
>>         self._post_launch()
>>
>>     def launch(self):
>>         try:
>>             self._launch()
>>         except:
>>             self._terminate()
>>             self._handle_shutdown()
>>             raise
>>
>>     def shutdown(self):
>>         '''Terminate the VM and clean up'''
>>         self._terminate()
>>         self._handle_shutdown()
>>
This part also caught my attention and I also meant to improve it when this series is merged. Anyway let's state my suggestions here, take it or let it go:

1. `get_log` should check whether `self._iolog` is `None` and then it should check for process status
2. the `self._iolog` is `None` is the indication whether `shutdown` was called or not (not whether the process exists or not)
3. add `__del__` to cleanup in case one forgets to call `shutdown` (currently the files and processes are left behind)
4. use `name = "qemu-%d-%d" % (os.getpid(), id(self))` to allow multiple instances with default name at the same time.


Also I just realized that even with just this patch (as is) files/processes can be left behind:

    >>> import qemu, os
    >>> a=qemu.QEMUMachine("/usr/bin/qemu-kvm", debug=True)
    >>> a.launch()
    QMP:>>> {'execute': 'qmp_capabilities'}
    QMP:<<< {u'return': {}}
    >>> a.is_running()
    False
    >>> a.shutdown()
    >>> os.path.exists(a._qemu_log_path)
    True

Before this patch it worked well as (as Eduardo mentioned) the `is_running` was tracing internal state, not the process state.

Regards,
Lukáš

>>
>> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
>>
>>>>>>      def exitcode(self):
>>>>>>          if self._popen is None:
>>>>>                return None
>>>>>            return self._popen.returncode
>>>>>
>>>>> Why is this one safe?
>>>>
>>>> Here it's used just to retrieve the value from the Popen.returncode.
>>>> It's not being used to check whether the process is running or not.
>>>
>>> If self._popen is not None, we return self._popen.returncode.  It's None
>>> if .poll() etc. haven't been called.  Can this happen?  If not, why not?
>>> If yes, why is returning None then okay?
>>
>> It can't happen because the only caller of exitcode()
>> (device-crash-test) calls it immediately after shutdown().  But
>> it would be nice to make exitcode() behavior consistent with
>> is_running().
>>
>> --
>> Eduardo
diff mbox

Patch

diff --git a/scripts/qemu.py b/scripts/qemu.py
index 880e3e8219..f0fade32bd 100644
--- a/scripts/qemu.py
+++ b/scripts/qemu.py
@@ -86,7 +86,7 @@  class QEMUMachine(object):
             raise
 
     def is_running(self):
-        return self._popen and (self._popen.returncode is None)
+        return self._popen and (self._popen.poll() is None)
 
     def exitcode(self):
         if self._popen is None: