diff mbox series

[ovs-dev,v1] ovs-tcpdump: Cleanup mirror failed with twice fatal signals

Message ID 20240221074214.96824-1-zhihui.ding@easystack.cn
State Changes Requested
Headers show
Series [ovs-dev,v1] ovs-tcpdump: Cleanup mirror failed with twice fatal signals | expand

Checks

Context Check Description
ovsrobot/apply-robot warning apply and check: warning
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Daniel Ding Feb. 21, 2024, 7:42 a.m. UTC
After running ovs-tcpdump and inputs multiple CTRL+C, the program will
raise the following exception.

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror
    ovsdb = OVSDB(db_sock)
  File "/usr/bin/ovs-tcpdump", line 168, in __init__
    OVSDB.wait_for_db_change(self._idl_conn)  # Initial Sync with DB
  File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change
    while idl.change_seqno == seq and not idl.run():

Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn>
---
 utilities/ovs-tcpdump.in | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Comments

Aaron Conole Feb. 21, 2024, 2:27 p.m. UTC | #1
Daniel Ding <danieldin186@gmail.com> writes:

> After running ovs-tcpdump and inputs multiple CTRL+C, the program will
> raise the following exception.
>
> Error in atexit._run_exitfuncs:
> Traceback (most recent call last):
>   File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror
>     ovsdb = OVSDB(db_sock)
>   File "/usr/bin/ovs-tcpdump", line 168, in __init__
>     OVSDB.wait_for_db_change(self._idl_conn)  # Initial Sync with DB
>   File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change
>     while idl.change_seqno == seq and not idl.run():
>
> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn>
> ---

LGTM for the linux side - maybe Alin might check the windows side.

When you post v2 you can keep my

Reviewed-by: Aaron Conole <aconole@redhat.com>

>  utilities/ovs-tcpdump.in | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/utilities/ovs-tcpdump.in b/utilities/ovs-tcpdump.in
> index 4cbd9a5d3..eec2262d6 100755
> --- a/utilities/ovs-tcpdump.in
> +++ b/utilities/ovs-tcpdump.in
> @@ -17,6 +17,7 @@
>  import os
>  import pwd
>  from random import randint
> +import signal
>  import subprocess
>  import sys
>  import time
> @@ -417,8 +418,22 @@ def py_which(executable):
>                 for path in os.environ["PATH"].split(os.pathsep))
>  
>  
> +def ignore_fatal_signals():
> +    if sys.platform == "win32":
> +        signals = [signal.SIGTERM, signal.SIGINT]
> +    else:
> +        signals = [signal.SIGTERM, signal.SIGINT, signal.SIGHUP,
> +                   signal.SIGALRM]
> +
> +    for signr in signals:
> +        signal.signal(signr, signal.SIG_IGN)
> +
> +

Just a nit, but it might be clearer to put the ignore_fatal_signals as a
function scoped in teardown below.  That way it is a bit clearer that
this will only be called to turn off these in the double failure case.

>  def teardown(db_sock, interface, mirror_interface, tap_created):
>      def cleanup_mirror():
> +        # Ignore fatal signals, avoid it to break OVSDB operations.
> +        # So that cleanup mirror and ports be finished.
> +        ignore_fatal_signals()
>          try:
>              ovsdb = OVSDB(db_sock)
>              ovsdb.destroy_mirror(interface, ovsdb.port_bridge(interface))
Ilya Maximets Feb. 21, 2024, 5:55 p.m. UTC | #2
On 2/21/24 15:27, Aaron Conole wrote:
> Daniel Ding <danieldin186@gmail.com> writes:
> 
>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will
>> raise the following exception.
>>
>> Error in atexit._run_exitfuncs:
>> Traceback (most recent call last):
>>   File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror
>>     ovsdb = OVSDB(db_sock)
>>   File "/usr/bin/ovs-tcpdump", line 168, in __init__
>>     OVSDB.wait_for_db_change(self._idl_conn)  # Initial Sync with DB
>>   File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change
>>     while idl.change_seqno == seq and not idl.run():
>>
>> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn>
>> ---
> 
> LGTM for the linux side - maybe Alin might check the windows side.

The code is copied from the fatal-signla library, so it is likely fine.
However, I don;t think this issue should be fixed on the application
level (ovs-tcpdump).   There seems to be adifference in how C and python
fatal-signla libraries behave.  The C version calls the hooks in the run()
function and the signal handler only updates the signal number variable.
So, if the same signal arrives multiple times it will be handled only once
and the run() function will not exit until all the hooks are done, unless
it is a segfault.

With python implementation we're calling hooks right from the signal
handler (it's not a real signal handler, so we're allowed to use not
really singal-safe functions there).  So, if another signal arrives while
we're executing the hooks, the second hook execution will be skipped
due to 'recursion' check, but the signal will be re-raised immediately,
killing the process.

There is likley a way to fix that by using a mutex instead of recursion
check and blocking it while executing the hooks.  The signal number
will need to be stored in the signal handler and the signal will need
to be re-raised in the call_hooks() instead of signal handler.
We can use mutex.acquire(blocking=False) as a way to prevent recursion.

Does that make sense?

Best regards, Ilya Maximets.
Daniel Ding Feb. 22, 2024, 10:02 a.m. UTC | #3
> 2024年2月22日 上午1:55,Ilya Maximets <i.maximets@ovn.org> 写道:
> 
> On 2/21/24 15:27, Aaron Conole wrote:
>> Daniel Ding <danieldin186@gmail.com> writes:
>> 
>>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will
>>> raise the following exception.
>>> 
>>> Error in atexit._run_exitfuncs:
>>> Traceback (most recent call last):
>>>  File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror
>>>    ovsdb = OVSDB(db_sock)
>>>  File "/usr/bin/ovs-tcpdump", line 168, in __init__
>>>    OVSDB.wait_for_db_change(self._idl_conn)  # Initial Sync with DB
>>>  File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change
>>>    while idl.change_seqno == seq and not idl.run():
>>> 
>>> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn>
>>> ---
>> 
>> LGTM for the linux side - maybe Alin might check the windows side.
> 
> The code is copied from the fatal-signla library, so it is likely fine.
> However, I don;t think this issue should be fixed on the application
> level (ovs-tcpdump).   There seems to be adifference in how C and python
> fatal-signla libraries behave.  The C version calls the hooks in the run()
> function and the signal handler only updates the signal number variable.
> So, if the same signal arrives multiple times it will be handled only once
> and the run() function will not exit until all the hooks are done, unless
> it is a segfault.
> 
> With python implementation we're calling hooks right from the signal
> handler (it's not a real signal handler, so we're allowed to use not
> really singal-safe functions there).  So, if another signal arrives while
> we're executing the hooks, the second hook execution will be skipped
> due to 'recursion' check, but the signal will be re-raised immediately,
> killing the process.

As your suggestions, I rewrite recurse as a threading.Lock. So that protect calling hooks
applications registered.

> 
> There is likley a way to fix that by using a mutex instead of recursion
> check and blocking it while executing the hooks.  The signal number
> will need to be stored in the signal handler and the signal will need
> to be re-raised in the call_hooks() instead of signal handler.
> We can use mutex.acquire(blocking=False) as a way to prevent recursion.
> 

To avoid breaking calling hooks, I set the signal restored in the signal handler to SIG_IGN in 
call_hooks. And move re-raised the signal in the call_hooks with locked.

> Does that make sense?
> 
> Best regards, Ilya Maximets.

diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py
index cb2e99e87..3aa56a166 100644
--- a/python/ovs/fatal_signal.py
+++ b/python/ovs/fatal_signal.py
@@ -16,6 +16,7 @@ import atexit
 import os
 import signal
 import sys
+import threading

 import ovs.vlog

@@ -112,28 +113,34 @@ def _unlink(file_):
 def _signal_handler(signr, _):
     _call_hooks(signr)

-    # Re-raise the signal with the default handling so that the program
-    # termination status reflects that we were killed by this signal.
-    signal.signal(signr, signal.SIG_DFL)
-    os.kill(os.getpid(), signr)
-

 def _atexit_handler():
     _call_hooks(0)


-recurse = False
+recurse = threading.Lock()


 def _call_hooks(signr):
     global recurse
-    if recurse:
+    if recurse.locked():
         return
-    recurse = True

-    for hook, cancel, run_at_exit in _hooks:
-        if signr != 0 or run_at_exit:
-            hook()
+    with recurse:
+        if signr != 0:
+            signal.signal(signr, signal.SIG_IGN)
+        else:
+            signal.signal(signal.SIGINT, signal.SIG_IGN)
+
+        for hook, cancel, run_at_exit in _hooks:
+            if signr != 0 or run_at_exit:
+                hook()
+
+        if signr != 0:
+           # Re-raise the signal with the default handling so that the program
+           # termination status reflects that we were killed by this signal.
+           signal.signal(signr, signal.SIG_DFL)
+           os.kill(os.getpid(), signr)


 _inited = False
@@ -165,7 +172,6 @@ def signal_alarm(timeout):

     if sys.platform == "win32":
         import time
-        import threading

         class Alarm (threading.Thread):
             def __init__(self, timeout):
--
2.43.0

Thanks Ilya Maximets, please help me review again.

Best regards, Daniel Ding.
Ilya Maximets Feb. 22, 2024, 10:53 a.m. UTC | #4
On 2/22/24 11:02, Daniel Ding wrote:
> 
> 
>> 2024年2月22日 上午1:55,Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> 写道:
>>
>> On 2/21/24 15:27, Aaron Conole wrote:
>>> Daniel Ding <danieldin186@gmail.com <mailto:danieldin186@gmail.com>> writes:
>>>
>>>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will
>>>> raise the following exception.
>>>>
>>>> Error in atexit._run_exitfuncs:
>>>> Traceback (most recent call last):
>>>>  File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror
>>>>    ovsdb = OVSDB(db_sock)
>>>>  File "/usr/bin/ovs-tcpdump", line 168, in __init__
>>>>    OVSDB.wait_for_db_change(self._idl_conn)  # Initial Sync with DB
>>>>  File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change
>>>>    while idl.change_seqno == seq and not idl.run():
>>>>
>>>> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn <mailto:zhihui.ding@easystack.cn>>
>>>> ---
>>>
>>> LGTM for the linux side - maybe Alin might check the windows side.
>>
>> The code is copied from the fatal-signla library, so it is likely fine.
>> However, I don;t think this issue should be fixed on the application
>> level (ovs-tcpdump).   There seems to be adifference in how C and python
>> fatal-signla libraries behave.  The C version calls the hooks in the run()
>> function and the signal handler only updates the signal number variable.
>> So, if the same signal arrives multiple times it will be handled only once
>> and the run() function will not exit until all the hooks are done, unless
>> it is a segfault.
>>
>> With python implementation we're calling hooks right from the signal
>> handler (it's not a real signal handler, so we're allowed to use not
>> really singal-safe functions there).  So, if another signal arrives while
>> we're executing the hooks, the second hook execution will be skipped
>> due to 'recursion' check, but the signal will be re-raised immediately,
>> killing the process.
> 
> As your suggestions, I rewrite recurse as a threading.Lock. So that protect calling hooks
> applications registered.
> 
>>
>> There is likley a way to fix that by using a mutex instead of recursion
>> check and blocking it while executing the hooks.  The signal number
>> will need to be stored in the signal handler and the signal will need
>> to be re-raised in the call_hooks() instead of signal handler.
>> We can use mutex.acquire(blocking=False) as a way to prevent recursion.
>>
> 
> To avoid breaking calling hooks, I set the signal restored in the signal handler to SIG_IGN in 
> call_hooks.

I'm not sure this is necessary, see below.

> And move re-raised the signal in the call_hooks with locked.
> 
>> Does that make sense?
>>
>> Best regards, Ilya Maximets.
> 
> diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py
> index cb2e99e87..3aa56a166 100644
> --- a/python/ovs/fatal_signal.py
> +++ b/python/ovs/fatal_signal.py
> @@ -16,6 +16,7 @@ import atexit
>  import os
>  import signal
>  import sys
> +import threading
> 
>  import ovs.vlog
> 
> @@ -112,28 +113,34 @@ def _unlink(file_):
>  def _signal_handler(signr, _):
>      _call_hooks(signr)
> 
> -    # Re-raise the signal with the default handling so that the program
> -    # termination status reflects that we were killed by this signal.
> -    signal.signal(signr, signal.SIG_DFL)
> -    os.kill(os.getpid(), signr)
> -
> 
>  def _atexit_handler():
>      _call_hooks(0)
> 
> 
> -recurse = False
> +recurse = threading.Lock()

It's better to rename this to just 'mutex' or 'lock'.

> 
> 
>  def _call_hooks(signr):
>      global recurse
> -    if recurse:
> +    if recurse.locked():
>          return

There is an awkward race window between checking that the mutex is
locked and actually attempting to take it below.  It's probably not
a big deal here, but it's better to take the lock at the same time
with checking, e.g.:

    if not mutex.acquire(blocking=False):
        return

And since we're going to end the process anyway (it's a handler
for fatal signals after all), we don't actually need to unlock it,
AFAICT.

> -    recurse = True
> 
> -    for hook, cancel, run_at_exit in _hooks:
> -        if signr != 0 or run_at_exit:
> -            hook()
> +    with recurse:
> +        if signr != 0:
> +            signal.signal(signr, signal.SIG_IGN)
> +        else:
> +            signal.signal(signal.SIGINT, signal.SIG_IGN)

If another signal arrives while we're under the lock, the signal
handler will call the _call_hooks(), check that the lock is already
taken and return.  There is no need to ignore it.
Also, if a different signal arrives the code above will not ignore
it anyway.

> +
> +        for hook, cancel, run_at_exit in _hooks:
> +            if signr != 0 or run_at_exit:
> +                hook()
> +
> +        if signr != 0:
> +           # Re-raise the signal with the default handling so that the program
> +           # termination status reflects that we were killed by this signal.
> +           signal.signal(signr, signal.SIG_DFL)
> +           os.kill(os.getpid(), signr)

We may want to place this in the try: finally: block, so we still
re-raise the signal even if one of the hooks throws an exception.
Hooks should not generally throw exceptions, but may be safer this
way.

What do you think?

> 
> 
>  _inited = False
> @@ -165,7 +172,6 @@ def signal_alarm(timeout):
> 
>      if sys.platform == "win32":
>          import time
> -        import threading
> 
>          class Alarm (threading.Thread):
>              def __init__(self, timeout):
> --
> 2.43.0
> 
> Thanks Ilya Maximets, please help me review again.
> 
> Best regards, Daniel Ding.
>
Ilya Maximets Feb. 22, 2024, 11:20 a.m. UTC | #5
On 2/22/24 11:53, Ilya Maximets wrote:
> On 2/22/24 11:02, Daniel Ding wrote:
>>
>>
>>> 2024年2月22日 上午1:55,Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> 写道:
>>>
>>> On 2/21/24 15:27, Aaron Conole wrote:
>>>> Daniel Ding <danieldin186@gmail.com <mailto:danieldin186@gmail.com>> writes:
>>>>
>>>>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will
>>>>> raise the following exception.
>>>>>
>>>>> Error in atexit._run_exitfuncs:
>>>>> Traceback (most recent call last):
>>>>>  File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror
>>>>>    ovsdb = OVSDB(db_sock)
>>>>>  File "/usr/bin/ovs-tcpdump", line 168, in __init__
>>>>>    OVSDB.wait_for_db_change(self._idl_conn)  # Initial Sync with DB
>>>>>  File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change
>>>>>    while idl.change_seqno == seq and not idl.run():
>>>>>
>>>>> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn <mailto:zhihui.ding@easystack.cn>>
>>>>> ---
>>>>
>>>> LGTM for the linux side - maybe Alin might check the windows side.
>>>
>>> The code is copied from the fatal-signla library, so it is likely fine.
>>> However, I don;t think this issue should be fixed on the application
>>> level (ovs-tcpdump).   There seems to be adifference in how C and python
>>> fatal-signla libraries behave.  The C version calls the hooks in the run()
>>> function and the signal handler only updates the signal number variable.
>>> So, if the same signal arrives multiple times it will be handled only once
>>> and the run() function will not exit until all the hooks are done, unless
>>> it is a segfault.
>>>
>>> With python implementation we're calling hooks right from the signal
>>> handler (it's not a real signal handler, so we're allowed to use not
>>> really singal-safe functions there).  So, if another signal arrives while
>>> we're executing the hooks, the second hook execution will be skipped
>>> due to 'recursion' check, but the signal will be re-raised immediately,
>>> killing the process.
>>
>> As your suggestions, I rewrite recurse as a threading.Lock. So that protect calling hooks
>> applications registered.
>>
>>>
>>> There is likley a way to fix that by using a mutex instead of recursion
>>> check and blocking it while executing the hooks.  The signal number
>>> will need to be stored in the signal handler and the signal will need
>>> to be re-raised in the call_hooks() instead of signal handler.
>>> We can use mutex.acquire(blocking=False) as a way to prevent recursion.
>>>
>>
>> To avoid breaking calling hooks, I set the signal restored in the signal handler to SIG_IGN in 
>> call_hooks.
> 
> I'm not sure this is necessary, see below.
> 
>> And move re-raised the signal in the call_hooks with locked.
>>
>>> Does that make sense?
>>>
>>> Best regards, Ilya Maximets.
>>
>> diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py
>> index cb2e99e87..3aa56a166 100644
>> --- a/python/ovs/fatal_signal.py
>> +++ b/python/ovs/fatal_signal.py
>> @@ -16,6 +16,7 @@ import atexit
>>  import os
>>  import signal
>>  import sys
>> +import threading
>>
>>  import ovs.vlog
>>
>> @@ -112,28 +113,34 @@ def _unlink(file_):
>>  def _signal_handler(signr, _):
>>      _call_hooks(signr)
>>
>> -    # Re-raise the signal with the default handling so that the program
>> -    # termination status reflects that we were killed by this signal.
>> -    signal.signal(signr, signal.SIG_DFL)
>> -    os.kill(os.getpid(), signr)
>> -
>>
>>  def _atexit_handler():
>>      _call_hooks(0)
>>
>>
>> -recurse = False
>> +recurse = threading.Lock()
> 
> It's better to rename this to just 'mutex' or 'lock'.
> 
>>
>>
>>  def _call_hooks(signr):
>>      global recurse
>> -    if recurse:
>> +    if recurse.locked():
>>          return
> 
> There is an awkward race window between checking that the mutex is
> locked and actually attempting to take it below.  It's probably not
> a big deal here, but it's better to take the lock at the same time
> with checking, e.g.:
> 
>     if not mutex.acquire(blocking=False):
>         return
> 
> And since we're going to end the process anyway (it's a handler
> for fatal signals after all), we don't actually need to unlock it,
> AFAICT.
> 
>> -    recurse = True
>>
>> -    for hook, cancel, run_at_exit in _hooks:
>> -        if signr != 0 or run_at_exit:
>> -            hook()
>> +    with recurse:
>> +        if signr != 0:
>> +            signal.signal(signr, signal.SIG_IGN)
>> +        else:
>> +            signal.signal(signal.SIGINT, signal.SIG_IGN)
> 
> If another signal arrives while we're under the lock, the signal
> handler will call the _call_hooks(), check that the lock is already
> taken and return.  There is no need to ignore it.
> Also, if a different signal arrives the code above will not ignore
> it anyway.
> 
>> +
>> +        for hook, cancel, run_at_exit in _hooks:
>> +            if signr != 0 or run_at_exit:
>> +                hook()
>> +
>> +        if signr != 0:
>> +           # Re-raise the signal with the default handling so that the program
>> +           # termination status reflects that we were killed by this signal.
>> +           signal.signal(signr, signal.SIG_DFL)
>> +           os.kill(os.getpid(), signr)
> 
> We may want to place this in the try: finally: block, so we still
> re-raise the signal even if one of the hooks throws an exception.
> Hooks should not generally throw exceptions, but may be safer this
> way.

On a second thought, if we re-raise after the exception in the
hook, we may loose the exception.  So, letting the exception to
kill the application may be better in this case then hiding it
with re-raising a signal.

> 
> What do you think?
> 
>>
>>
>>  _inited = False
>> @@ -165,7 +172,6 @@ def signal_alarm(timeout):
>>
>>      if sys.platform == "win32":
>>          import time
>> -        import threading
>>
>>          class Alarm (threading.Thread):
>>              def __init__(self, timeout):
>> --
>> 2.43.0
>>
>> Thanks Ilya Maximets, please help me review again.
>>
>> Best regards, Daniel Ding.
>>
>
Daniel Ding Feb. 22, 2024, 1:12 p.m. UTC | #6
> 2024年2月22日 下午7:20,Ilya Maximets <i.maximets@ovn.org> 写道:
> 
> On 2/22/24 11:53, Ilya Maximets wrote:
>> On 2/22/24 11:02, Daniel Ding wrote:
>>> 
>>> 
>>>> 2024年2月22日 上午1:55,Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> 写道:
>>>> 
>>>> On 2/21/24 15:27, Aaron Conole wrote:
>>>>> Daniel Ding <danieldin186@gmail.com <mailto:danieldin186@gmail.com>> writes:
>>>>> 
>>>>>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will
>>>>>> raise the following exception.
>>>>>> 
>>>>>> Error in atexit._run_exitfuncs:
>>>>>> Traceback (most recent call last):
>>>>>>  File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror
>>>>>>    ovsdb = OVSDB(db_sock)
>>>>>>  File "/usr/bin/ovs-tcpdump", line 168, in __init__
>>>>>>    OVSDB.wait_for_db_change(self._idl_conn)  # Initial Sync with DB
>>>>>>  File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change
>>>>>>    while idl.change_seqno == seq and not idl.run():
>>>>>> 
>>>>>> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn <mailto:zhihui.ding@easystack.cn>>
>>>>>> ---
>>>>> 
>>>>> LGTM for the linux side - maybe Alin might check the windows side.
>>>> 
>>>> The code is copied from the fatal-signla library, so it is likely fine.
>>>> However, I don;t think this issue should be fixed on the application
>>>> level (ovs-tcpdump).   There seems to be adifference in how C and python
>>>> fatal-signla libraries behave.  The C version calls the hooks in the run()
>>>> function and the signal handler only updates the signal number variable.
>>>> So, if the same signal arrives multiple times it will be handled only once
>>>> and the run() function will not exit until all the hooks are done, unless
>>>> it is a segfault.
>>>> 
>>>> With python implementation we're calling hooks right from the signal
>>>> handler (it's not a real signal handler, so we're allowed to use not
>>>> really singal-safe functions there).  So, if another signal arrives while
>>>> we're executing the hooks, the second hook execution will be skipped
>>>> due to 'recursion' check, but the signal will be re-raised immediately,
>>>> killing the process.
>>> 
>>> As your suggestions, I rewrite recurse as a threading.Lock. So that protect calling hooks
>>> applications registered.
>>> 
>>>> 
>>>> There is likley a way to fix that by using a mutex instead of recursion
>>>> check and blocking it while executing the hooks.  The signal number
>>>> will need to be stored in the signal handler and the signal will need
>>>> to be re-raised in the call_hooks() instead of signal handler.
>>>> We can use mutex.acquire(blocking=False) as a way to prevent recursion.
>>>> 
>>> 
>>> To avoid breaking calling hooks, I set the signal restored in the signal handler to SIG_IGN in 
>>> call_hooks.
>> 
>> I'm not sure this is necessary, see below.

The default handler of SIGINT is default_int_handler, so it not register the signal handler
successfully. When received CTRL+C again, the program be break, and calling hook can't
execute completely.

>> 
>>> And move re-raised the signal in the call_hooks with locked.
>>> 
>>>> Does that make sense?
>>>> 
>>>> Best regards, Ilya Maximets.
>>> 
>>> diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py
>>> index cb2e99e87..3aa56a166 100644
>>> --- a/python/ovs/fatal_signal.py
>>> +++ b/python/ovs/fatal_signal.py
>>> @@ -16,6 +16,7 @@ import atexit
>>>  import os
>>>  import signal
>>>  import sys
>>> +import threading
>>> 
>>>  import ovs.vlog
>>> 
>>> @@ -112,28 +113,34 @@ def _unlink(file_):
>>>  def _signal_handler(signr, _):
>>>      _call_hooks(signr)
>>> 
>>> -    # Re-raise the signal with the default handling so that the program
>>> -    # termination status reflects that we were killed by this signal.
>>> -    signal.signal(signr, signal.SIG_DFL)
>>> -    os.kill(os.getpid(), signr)
>>> -
>>> 
>>>  def _atexit_handler():
>>>      _call_hooks(0)
>>> 
>>> 
>>> -recurse = False
>>> +recurse = threading.Lock()
>> 
>> It's better to rename this to just 'mutex' or 'lock'.

Done

>> 
>>> 
>>> 
>>>  def _call_hooks(signr):
>>>      global recurse
>>> -    if recurse:
>>> +    if recurse.locked():
>>>          return
>> 
>> There is an awkward race window between checking that the mutex is
>> locked and actually attempting to take it below.  It's probably not
>> a big deal here, but it's better to take the lock at the same time
>> with checking, e.g.:
>> 
>>    if not mutex.acquire(blocking=False):
>>        return
>> 
>> And since we're going to end the process anyway (it's a handler
>> for fatal signals after all), we don't actually need to unlock it,
>> AFAICT.

Agree it.

>> 
>>> -    recurse = True
>>> 
>>> -    for hook, cancel, run_at_exit in _hooks:
>>> -        if signr != 0 or run_at_exit:
>>> -            hook()
>>> +    with recurse:
>>> +        if signr != 0:
>>> +            signal.signal(signr, signal.SIG_IGN)
>>> +        else:
>>> +            signal.signal(signal.SIGINT, signal.SIG_IGN)
>> 
>> If another signal arrives while we're under the lock, the signal
>> handler will call the _call_hooks(), check that the lock is already
>> taken and return.  There is no need to ignore it.
>> Also, if a different signal arrives the code above will not ignore
>> it anyway.
>> 

You’re right, but the SIGINT is exception in the _init because its default handler is not SIG_DFL.
So I append a new condition for SIGINT to register the signal handler.

        for signr in signals:
            handler = signal.getsignal(signr)
            if (handler == signal.SIG_DFL or
                handler == signal.default_int_handler):
                signal.signal(signr, _signal_handler)


>>> +
>>> +        for hook, cancel, run_at_exit in _hooks:
>>> +            if signr != 0 or run_at_exit:
>>> +                hook()
>>> +
>>> +        if signr != 0:
>>> +           # Re-raise the signal with the default handling so that the program
>>> +           # termination status reflects that we were killed by this signal.
>>> +           signal.signal(signr, signal.SIG_DFL)
>>> +           os.kill(os.getpid(), signr)
>> 
>> We may want to place this in the try: finally: block, so we still
>> re-raise the signal even if one of the hooks throws an exception.
>> Hooks should not generally throw exceptions, but may be safer this
>> way.
> 
> On a second thought, if we re-raise after the exception in the
> hook, we may loose the exception.  So, letting the exception to
> kill the application may be better in this case then hiding it
> with re-raising a signal.
> 
>> 
>> What do you think?
>> 
>>> 
>>> 
>>>  _inited = False
>>> @@ -165,7 +172,6 @@ def signal_alarm(timeout):
>>> 
>>>      if sys.platform == "win32":
>>>          import time
>>> -        import threading
>>> 
>>>          class Alarm (threading.Thread):
>>>              def __init__(self, timeout):
>>> --
>>> 2.43.0
>>> 
>>> Thanks Ilya Maximets, please help me review again.
>>> 
>>> Best regards, Daniel Ding.


 python/ovs/fatal_signal.py | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py
index cb2e99e87..077f50dd5 100644
--- a/python/ovs/fatal_signal.py
+++ b/python/ovs/fatal_signal.py
@@ -16,6 +16,7 @@ import atexit
 import os
 import signal
 import sys
+import threading

 import ovs.vlog

@@ -112,29 +113,29 @@ def _unlink(file_):
 def _signal_handler(signr, _):
     _call_hooks(signr)

-    # Re-raise the signal with the default handling so that the program
-    # termination status reflects that we were killed by this signal.
-    signal.signal(signr, signal.SIG_DFL)
-    os.kill(os.getpid(), signr)
-

 def _atexit_handler():
     _call_hooks(0)


-recurse = False
+mutex = threading.Lock()


 def _call_hooks(signr):
-    global recurse
-    if recurse:
+    global mutex
+    if not mutex.acquire(blocking=False):
         return
-    recurse = True

     for hook, cancel, run_at_exit in _hooks:
         if signr != 0 or run_at_exit:
             hook()

+    if signr != 0:
+       # Re-raise the signal with the default handling so that the program
+       # termination status reflects that we were killed by this signal.
+       signal.signal(signr, signal.SIG_DFL)
+       os.kill(os.getpid(), signr)
+

 _inited = False

@@ -150,7 +151,9 @@ def _init():
                        signal.SIGALRM]

         for signr in signals:
-            if signal.getsignal(signr) == signal.SIG_DFL:
+            handler = signal.getsignal(signr)
+            if (handler == signal.SIG_DFL or
+                handler == signal.default_int_handler):
                 signal.signal(signr, _signal_handler)
         atexit.register(_atexit_handler)

@@ -165,7 +168,6 @@ def signal_alarm(timeout):

     if sys.platform == "win32":
         import time
-        import threading

         class Alarm (threading.Thread):
             def __init__(self, timeout):



Thanks for your review, Ilya Maximets.


Regards,
Daniel Ding
Ilya Maximets Feb. 23, 2024, 1:36 a.m. UTC | #7
On 2/22/24 14:12, Daniel Ding wrote:
> 
> 
> 
>> 2024年2月22日 下午7:20,Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> 写道:
>>
>> On 2/22/24 11:53, Ilya Maximets wrote:
>>> On 2/22/24 11:02, Daniel Ding wrote:
>>>>
>>>>
>>>>> 2024年2月22日 上午1:55,Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org> <mailto:i.maximets@ovn.org <mailto:i.maximets@ovn.org>>> 写道:
>>>>>
>>>>> On 2/21/24 15:27, Aaron Conole wrote:
>>>>>> Daniel Ding <danieldin186@gmail.com <mailto:danieldin186@gmail.com> <mailto:danieldin186@gmail.com <mailto:danieldin186@gmail.com>>> writes:
>>>>>>
>>>>>>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will
>>>>>>> raise the following exception.
>>>>>>>
>>>>>>> Error in atexit._run_exitfuncs:
>>>>>>> Traceback (most recent call last):
>>>>>>>  File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror
>>>>>>>    ovsdb = OVSDB(db_sock)
>>>>>>>  File "/usr/bin/ovs-tcpdump", line 168, in __init__
>>>>>>>    OVSDB.wait_for_db_change(self._idl_conn)  # Initial Sync with DB
>>>>>>>  File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change
>>>>>>>    while idl.change_seqno == seq and not idl.run():
>>>>>>>
>>>>>>> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn <mailto:zhihui.ding@easystack.cn> <mailto:zhihui.ding@easystack.cn <mailto:zhihui.ding@easystack.cn>>>
>>>>>>> ---
>>>>>>
>>>>>> LGTM for the linux side - maybe Alin might check the windows side.
>>>>>
>>>>> The code is copied from the fatal-signla library, so it is likely fine.
>>>>> However, I don;t think this issue should be fixed on the application
>>>>> level (ovs-tcpdump).   There seems to be adifference in how C and python
>>>>> fatal-signla libraries behave.  The C version calls the hooks in the run()
>>>>> function and the signal handler only updates the signal number variable.
>>>>> So, if the same signal arrives multiple times it will be handled only once
>>>>> and the run() function will not exit until all the hooks are done, unless
>>>>> it is a segfault.
>>>>>
>>>>> With python implementation we're calling hooks right from the signal
>>>>> handler (it's not a real signal handler, so we're allowed to use not
>>>>> really singal-safe functions there).  So, if another signal arrives while
>>>>> we're executing the hooks, the second hook execution will be skipped
>>>>> due to 'recursion' check, but the signal will be re-raised immediately,
>>>>> killing the process.
>>>>
>>>> As your suggestions, I rewrite recurse as a threading.Lock. So that protect calling hooks
>>>> applications registered.
>>>>
>>>>>
>>>>> There is likley a way to fix that by using a mutex instead of recursion
>>>>> check and blocking it while executing the hooks.  The signal number
>>>>> will need to be stored in the signal handler and the signal will need
>>>>> to be re-raised in the call_hooks() instead of signal handler.
>>>>> We can use mutex.acquire(blocking=False) as a way to prevent recursion.
>>>>>
>>>>
>>>> To avoid breaking calling hooks, I set the signal restored in the signal handler to SIG_IGN in 
>>>> call_hooks.
>>>
>>> I'm not sure this is necessary, see below.
> 
> The default handler of SIGINT is default_int_handler, so it not register the signal handler
> successfully. When received CTRL+C again, the program be break, and calling hook can't
> execute completely.
> 
>>>
>>>> And move re-raised the signal in the call_hooks with locked.
>>>>
>>>>> Does that make sense?
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>
>>>> diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py
>>>> index cb2e99e87..3aa56a166 100644
>>>> --- a/python/ovs/fatal_signal.py
>>>> +++ b/python/ovs/fatal_signal.py
>>>> @@ -16,6 +16,7 @@ import atexit
>>>>  import os
>>>>  import signal
>>>>  import sys
>>>> +import threading
>>>>
>>>>  import ovs.vlog
>>>>
>>>> @@ -112,28 +113,34 @@ def _unlink(file_):
>>>>  def _signal_handler(signr, _):
>>>>      _call_hooks(signr)
>>>>
>>>> -    # Re-raise the signal with the default handling so that the program
>>>> -    # termination status reflects that we were killed by this signal.
>>>> -    signal.signal(signr, signal.SIG_DFL)
>>>> -    os.kill(os.getpid(), signr)
>>>> -
>>>>
>>>>  def _atexit_handler():
>>>>      _call_hooks(0)
>>>>
>>>>
>>>> -recurse = False
>>>> +recurse = threading.Lock()
>>>
>>> It's better to rename this to just 'mutex' or 'lock'.
> 
> Done
> 
>>>
>>>>
>>>>
>>>>  def _call_hooks(signr):
>>>>      global recurse
>>>> -    if recurse:
>>>> +    if recurse.locked():
>>>>          return
>>>
>>> There is an awkward race window between checking that the mutex is
>>> locked and actually attempting to take it below.  It's probably not
>>> a big deal here, but it's better to take the lock at the same time
>>> with checking, e.g.:
>>>
>>>    if not mutex.acquire(blocking=False):
>>>        return
>>>
>>> And since we're going to end the process anyway (it's a handler
>>> for fatal signals after all), we don't actually need to unlock it,
>>> AFAICT.
> 
> Agree it.
> 
>>>
>>>> -    recurse = True
>>>>
>>>> -    for hook, cancel, run_at_exit in _hooks:
>>>> -        if signr != 0 or run_at_exit:
>>>> -            hook()
>>>> +    with recurse:
>>>> +        if signr != 0:
>>>> +            signal.signal(signr, signal.SIG_IGN)
>>>> +        else:
>>>> +            signal.signal(signal.SIGINT, signal.SIG_IGN)
>>>
>>> If another signal arrives while we're under the lock, the signal
>>> handler will call the _call_hooks(), check that the lock is already
>>> taken and return.  There is no need to ignore it.
>>> Also, if a different signal arrives the code above will not ignore
>>> it anyway.
>>>
> 
> You’re right, but the SIGINT is exception in the _init because its default handler is not SIG_DFL.
> So I append a new condition for SIGINT to register the signal handler.
> 
>         for signr in signals:
>             handler = signal.getsignal(signr)
>             if (handler == signal.SIG_DFL or
>                 handler == signal.default_int_handler):
>                 signal.signal(signr, _signal_handler)

Hmm, yeah, makes sense.

With that, do we need to catch the KeyboardInterrupt exception
in the ovs-tcpdump now?

> 
> 
>>>> +
>>>> +        for hook, cancel, run_at_exit in _hooks:
>>>> +            if signr != 0 or run_at_exit:
>>>> +                hook()
>>>> +
>>>> +        if signr != 0:
>>>> +           # Re-raise the signal with the default handling so that the program
>>>> +           # termination status reflects that we were killed by this signal.
>>>> +           signal.signal(signr, signal.SIG_DFL)
>>>> +           os.kill(os.getpid(), signr)
>>>
>>> We may want to place this in the try: finally: block, so we still
>>> re-raise the signal even if one of the hooks throws an exception.
>>> Hooks should not generally throw exceptions, but may be safer this
>>> way.
>>
>> On a second thought, if we re-raise after the exception in the
>> hook, we may loose the exception.  So, letting the exception to
>> kill the application may be better in this case then hiding it
>> with re-raising a signal.
>>
>>>
>>> What do you think?
>>>
>>>>
>>>>
>>>>  _inited = False
>>>> @@ -165,7 +172,6 @@ def signal_alarm(timeout):
>>>>
>>>>      if sys.platform == "win32":
>>>>          import time
>>>> -        import threading
>>>>
>>>>          class Alarm (threading.Thread):
>>>>              def __init__(self, timeout):
>>>> --
>>>> 2.43.0
>>>>
>>>> Thanks Ilya Maximets, please help me review again.
>>>>
>>>> Best regards, Daniel Ding.
> 
> 
>  python/ovs/fatal_signal.py | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py
> index cb2e99e87..077f50dd5 100644
> --- a/python/ovs/fatal_signal.py
> +++ b/python/ovs/fatal_signal.py
> @@ -16,6 +16,7 @@ import atexit
>  import os
>  import signal
>  import sys
> +import threading
> 
>  import ovs.vlog
> 
> @@ -112,29 +113,29 @@ def _unlink(file_):
>  def _signal_handler(signr, _):
>      _call_hooks(signr)
> 
> -    # Re-raise the signal with the default handling so that the program
> -    # termination status reflects that we were killed by this signal.
> -    signal.signal(signr, signal.SIG_DFL)
> -    os.kill(os.getpid(), signr)
> -
> 
>  def _atexit_handler():
>      _call_hooks(0)
> 
> 
> -recurse = False
> +mutex = threading.Lock()
> 
> 
>  def _call_hooks(signr):
> -    global recurse
> -    if recurse:
> +    global mutex
> +    if not mutex.acquire(blocking=False):
>          return
> -    recurse = True
> 
>      for hook, cancel, run_at_exit in _hooks:
>          if signr != 0 or run_at_exit:
>              hook()
> 
> +    if signr != 0:
> +       # Re-raise the signal with the default handling so that the program
> +       # termination status reflects that we were killed by this signal.
> +       signal.signal(signr, signal.SIG_DFL)
> +       os.kill(os.getpid(), signr)
> +
> 
>  _inited = False
> 
> @@ -150,7 +151,9 @@ def _init():
>                         signal.SIGALRM]
> 
>          for signr in signals:
> -            if signal.getsignal(signr) == signal.SIG_DFL:
> +            handler = signal.getsignal(signr)
> +            if (handler == signal.SIG_DFL or
> +                handler == signal.default_int_handler):
>                  signal.signal(signr, _signal_handler)
>          atexit.register(_atexit_handler)
> 
> @@ -165,7 +168,6 @@ def signal_alarm(timeout):
> 
>      if sys.platform == "win32":
>          import time
> -        import threading
> 
>          class Alarm (threading.Thread):
>              def __init__(self, timeout):
> 
> 
> 
> Thanks for your review, Ilya Maximets.

I left a comment above that may need to be addressed.
But otherwise, this version seems mostly fine.  

Could you please send it as a new PATCH v2 with adjusted
patch name and the commit message?  It will get tested and
it will be easier to continue reviews this way.

Thanks!

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/utilities/ovs-tcpdump.in b/utilities/ovs-tcpdump.in
index 4cbd9a5d3..eec2262d6 100755
--- a/utilities/ovs-tcpdump.in
+++ b/utilities/ovs-tcpdump.in
@@ -17,6 +17,7 @@ 
 import os
 import pwd
 from random import randint
+import signal
 import subprocess
 import sys
 import time
@@ -417,8 +418,22 @@  def py_which(executable):
                for path in os.environ["PATH"].split(os.pathsep))
 
 
+def ignore_fatal_signals():
+    if sys.platform == "win32":
+        signals = [signal.SIGTERM, signal.SIGINT]
+    else:
+        signals = [signal.SIGTERM, signal.SIGINT, signal.SIGHUP,
+                   signal.SIGALRM]
+
+    for signr in signals:
+        signal.signal(signr, signal.SIG_IGN)
+
+
 def teardown(db_sock, interface, mirror_interface, tap_created):
     def cleanup_mirror():
+        # Ignore fatal signals, avoid it to break OVSDB operations.
+        # So that cleanup mirror and ports be finished.
+        ignore_fatal_signals()
         try:
             ovsdb = OVSDB(db_sock)
             ovsdb.destroy_mirror(interface, ovsdb.port_bridge(interface))