diff mbox series

[ovs-dev] vlog: Log stack trace on vlog_abort.

Message ID 20240405220857.2360287-1-i.maximets@ovn.org
State Accepted
Commit a75e1c37aae62a4559a52907ad9962cfcba77253
Delegated to: Ilya Maximets
Headers show
Series [ovs-dev] vlog: Log stack trace on vlog_abort. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Ilya Maximets April 5, 2024, 10:08 p.m. UTC
Currently, calls like ovs_assert() just print out a condition that
caused assertion to fail.  But it may be not enough to understand what
exactly has happened, especially if assertion failed in some generic
function like dp_packet_resize() or similar.

Print the stack trace along with the abort message to give more context
for the later debugging.

This should be especially useful in case the issue happens in an
environment with core dumps disabled.

Adding the log to vlog_abort() to cover a little more cases where
VLOG_ABORT is called and not only assertion failures.

It would be nice to also have stack traces in case of reaching the
OVS_NOT_REACHED().  But this macro is used in some places as a last
resort and should not actually do more than just stopping the process
immediately.  And it also can be used in contexts without logging
initialized.  Such a change will need to be done more carefully.
Better solution might be to use VLOG_ABORT() where appropriate instead.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 lib/vlog.c       | 10 ++++++++--
 tests/library.at |  4 +++-
 2 files changed, 11 insertions(+), 3 deletions(-)

Comments

Ilya Maximets April 10, 2024, 12:10 p.m. UTC | #1
On 4/6/24 00:08, Ilya Maximets wrote:
> Currently, calls like ovs_assert() just print out a condition that
> caused assertion to fail.  But it may be not enough to understand what
> exactly has happened, especially if assertion failed in some generic
> function like dp_packet_resize() or similar.
> 
> Print the stack trace along with the abort message to give more context
> for the later debugging.
> 
> This should be especially useful in case the issue happens in an
> environment with core dumps disabled.
> 
> Adding the log to vlog_abort() to cover a little more cases where
> VLOG_ABORT is called and not only assertion failures.
> 
> It would be nice to also have stack traces in case of reaching the
> OVS_NOT_REACHED().  But this macro is used in some places as a last
> resort and should not actually do more than just stopping the process
> immediately.  And it also can be used in contexts without logging
> initialized.  Such a change will need to be done more carefully.
> Better solution might be to use VLOG_ABORT() where appropriate instead.
> 
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>  lib/vlog.c       | 10 ++++++++--
>  tests/library.at |  4 +++-
>  2 files changed, 11 insertions(+), 3 deletions(-)

If this change is accepted, I'd also suggest backporting it to 3.3.
It is a debug-only change that touches only the code executed under
fatal failure conditions, so should be safe enough.  Backporting
will allow us easier debugging in to-be-LTS release.  And also OVN
24.03 LTS can make use of it this way as well.

Best regards, Ilya Maximets.
Kevin Traynor April 10, 2024, 3:01 p.m. UTC | #2
On 05/04/2024 23:08, Ilya Maximets wrote:
> Currently, calls like ovs_assert() just print out a condition that
> caused assertion to fail.  But it may be not enough to understand what
> exactly has happened, especially if assertion failed in some generic
> function like dp_packet_resize() or similar.
> 
> Print the stack trace along with the abort message to give more context
> for the later debugging.
> 
> This should be especially useful in case the issue happens in an
> environment with core dumps disabled.
> 
> Adding the log to vlog_abort() to cover a little more cases where
> VLOG_ABORT is called and not only assertion failures.
> 
> It would be nice to also have stack traces in case of reaching the
> OVS_NOT_REACHED().  But this macro is used in some places as a last
> resort and should not actually do more than just stopping the process
> immediately.  And it also can be used in contexts without logging
> initialized.  Such a change will need to be done more carefully.
> Better solution might be to use VLOG_ABORT() where appropriate instead.
> 
Thanks Ilya. Tried it and it's working. One comment below.

> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>  lib/vlog.c       | 10 ++++++++--
>  tests/library.at |  4 +++-
>  2 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/vlog.c b/lib/vlog.c
> index b2653142f..e78c785f7 100644
> --- a/lib/vlog.c
> +++ b/lib/vlog.c
> @@ -29,6 +29,7 @@
>  #include <time.h>
>  #include <unistd.h>
>  #include "async-append.h"
> +#include "backtrace.h"
>  #include "coverage.h"
>  #include "dirs.h"
>  #include "openvswitch/dynamic-string.h"
> @@ -1274,8 +1275,9 @@ vlog_fatal(const struct vlog_module *module, const char *message, ...)
>      va_end(args);
>  }
>  
> -/* Logs 'message' to 'module' at maximum verbosity, then calls abort().  Always
> - * writes the message to stderr, even if the console destination is disabled.
> +/* Attempts to log a stack trace, logs 'message' to 'module' at maximum
> + * verbosity, then calls abort().  Always writes the message to stderr, even
> + * if the console destination is disabled.
>   *
>   * Choose this function instead of vlog_fatal_valist() if the daemon monitoring
>   * facility should automatically restart the current daemon.  */
> @@ -1289,6 +1291,10 @@ vlog_abort_valist(const struct vlog_module *module_,
>       * message written by the later ovs_abort_valist(). */
>      module->levels[VLF_CONSOLE] = VLL_OFF;
>  
> +    /* Printing the stack trace before the 'message', because the 'message'
> +     * will flush the async log queue (VLL_EMER).  With a different order we
> +     * would need to flush the queue manually again. */
> +    log_backtrace();
>      vlog_valist(module, VLL_EMER, message, args);
>      ovs_abort_valist(0, message, args);
>  }

Is it worth adding to vlog_fatal_valist() as well for VLOG_FATAL()?

If there's some reason not to, then LGTM as is.

Acked-by: Kevin Traynor <ktraynor@redhat.com>

> diff --git a/tests/library.at b/tests/library.at
> index 7b4acebb8..d962e1b3f 100644
> --- a/tests/library.at
> +++ b/tests/library.at
> @@ -230,7 +230,9 @@ AT_CHECK([ovstest test-util -voff -vfile:info '-vPATTERN:file:%c|%p|%m' --log-fi
>    [$exit_status], [], [stderr])
>  
>  AT_CHECK([sed 's/\(opened log file\) .*/\1/
> -s/|[[^|]]*: /|/' test-util.log], [0], [dnl
> +s/|[[^|]]*: /|/
> +/backtrace/d
> +/|.*|/!d' test-util.log], [0], [dnl
>  vlog|INFO|opened log file
>  util|EMER|assertion false failed in test_assert()
>  ])
Ilya Maximets April 10, 2024, 3:16 p.m. UTC | #3
On 4/10/24 17:01, Kevin Traynor wrote:
> On 05/04/2024 23:08, Ilya Maximets wrote:
>> Currently, calls like ovs_assert() just print out a condition that
>> caused assertion to fail.  But it may be not enough to understand what
>> exactly has happened, especially if assertion failed in some generic
>> function like dp_packet_resize() or similar.
>>
>> Print the stack trace along with the abort message to give more context
>> for the later debugging.
>>
>> This should be especially useful in case the issue happens in an
>> environment with core dumps disabled.
>>
>> Adding the log to vlog_abort() to cover a little more cases where
>> VLOG_ABORT is called and not only assertion failures.
>>
>> It would be nice to also have stack traces in case of reaching the
>> OVS_NOT_REACHED().  But this macro is used in some places as a last
>> resort and should not actually do more than just stopping the process
>> immediately.  And it also can be used in contexts without logging
>> initialized.  Such a change will need to be done more carefully.
>> Better solution might be to use VLOG_ABORT() where appropriate instead.
>>
> Thanks Ilya. Tried it and it's working. One comment below.
> 
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>> ---
>>  lib/vlog.c       | 10 ++++++++--
>>  tests/library.at |  4 +++-
>>  2 files changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/lib/vlog.c b/lib/vlog.c
>> index b2653142f..e78c785f7 100644
>> --- a/lib/vlog.c
>> +++ b/lib/vlog.c
>> @@ -29,6 +29,7 @@
>>  #include <time.h>
>>  #include <unistd.h>
>>  #include "async-append.h"
>> +#include "backtrace.h"
>>  #include "coverage.h"
>>  #include "dirs.h"
>>  #include "openvswitch/dynamic-string.h"
>> @@ -1274,8 +1275,9 @@ vlog_fatal(const struct vlog_module *module, const char *message, ...)
>>      va_end(args);
>>  }
>>  
>> -/* Logs 'message' to 'module' at maximum verbosity, then calls abort().  Always
>> - * writes the message to stderr, even if the console destination is disabled.
>> +/* Attempts to log a stack trace, logs 'message' to 'module' at maximum
>> + * verbosity, then calls abort().  Always writes the message to stderr, even
>> + * if the console destination is disabled.
>>   *
>>   * Choose this function instead of vlog_fatal_valist() if the daemon monitoring
>>   * facility should automatically restart the current daemon.  */
>> @@ -1289,6 +1291,10 @@ vlog_abort_valist(const struct vlog_module *module_,
>>       * message written by the later ovs_abort_valist(). */
>>      module->levels[VLF_CONSOLE] = VLL_OFF;
>>  
>> +    /* Printing the stack trace before the 'message', because the 'message'
>> +     * will flush the async log queue (VLL_EMER).  With a different order we
>> +     * would need to flush the queue manually again. */
>> +    log_backtrace();
>>      vlog_valist(module, VLL_EMER, message, args);
>>      ovs_abort_valist(0, message, args);
>>  }
> 
> Is it worth adding to vlog_fatal_valist() as well for VLOG_FATAL()?
> 
> If there's some reason not to, then LGTM as is.

VLOG_FATAL is used more widely for user-errors or environment issues.
It's not appropriate to print stack traces in such scenarios.
For example, I don't think we want to dump the trace when users provide
incorrect command line arguments in tools (VLOG_FATAL is used for that).

Abort, OTOH, usually signifies a programming error and should not actually
be invoked under normal circumstances, so it should be fine to dump the
stack there.

> 
> Acked-by: Kevin Traynor <ktraynor@redhat.com>
> 
>> diff --git a/tests/library.at b/tests/library.at
>> index 7b4acebb8..d962e1b3f 100644
>> --- a/tests/library.at
>> +++ b/tests/library.at
>> @@ -230,7 +230,9 @@ AT_CHECK([ovstest test-util -voff -vfile:info '-vPATTERN:file:%c|%p|%m' --log-fi
>>    [$exit_status], [], [stderr])
>>  
>>  AT_CHECK([sed 's/\(opened log file\) .*/\1/
>> -s/|[[^|]]*: /|/' test-util.log], [0], [dnl
>> +s/|[[^|]]*: /|/
>> +/backtrace/d
>> +/|.*|/!d' test-util.log], [0], [dnl
>>  vlog|INFO|opened log file
>>  util|EMER|assertion false failed in test_assert()
>>  ])
>
Kevin Traynor April 10, 2024, 3:24 p.m. UTC | #4
On 10/04/2024 16:16, Ilya Maximets wrote:
> On 4/10/24 17:01, Kevin Traynor wrote:
>> On 05/04/2024 23:08, Ilya Maximets wrote:
>>> Currently, calls like ovs_assert() just print out a condition that
>>> caused assertion to fail.  But it may be not enough to understand what
>>> exactly has happened, especially if assertion failed in some generic
>>> function like dp_packet_resize() or similar.
>>>
>>> Print the stack trace along with the abort message to give more context
>>> for the later debugging.
>>>
>>> This should be especially useful in case the issue happens in an
>>> environment with core dumps disabled.
>>>
>>> Adding the log to vlog_abort() to cover a little more cases where
>>> VLOG_ABORT is called and not only assertion failures.
>>>
>>> It would be nice to also have stack traces in case of reaching the
>>> OVS_NOT_REACHED().  But this macro is used in some places as a last
>>> resort and should not actually do more than just stopping the process
>>> immediately.  And it also can be used in contexts without logging
>>> initialized.  Such a change will need to be done more carefully.
>>> Better solution might be to use VLOG_ABORT() where appropriate instead.
>>>
>> Thanks Ilya. Tried it and it's working. One comment below.
>>
>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>> ---
>>>  lib/vlog.c       | 10 ++++++++--
>>>  tests/library.at |  4 +++-
>>>  2 files changed, 11 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/lib/vlog.c b/lib/vlog.c
>>> index b2653142f..e78c785f7 100644
>>> --- a/lib/vlog.c
>>> +++ b/lib/vlog.c
>>> @@ -29,6 +29,7 @@
>>>  #include <time.h>
>>>  #include <unistd.h>
>>>  #include "async-append.h"
>>> +#include "backtrace.h"
>>>  #include "coverage.h"
>>>  #include "dirs.h"
>>>  #include "openvswitch/dynamic-string.h"
>>> @@ -1274,8 +1275,9 @@ vlog_fatal(const struct vlog_module *module, const char *message, ...)
>>>      va_end(args);
>>>  }
>>>  
>>> -/* Logs 'message' to 'module' at maximum verbosity, then calls abort().  Always
>>> - * writes the message to stderr, even if the console destination is disabled.
>>> +/* Attempts to log a stack trace, logs 'message' to 'module' at maximum
>>> + * verbosity, then calls abort().  Always writes the message to stderr, even
>>> + * if the console destination is disabled.
>>>   *
>>>   * Choose this function instead of vlog_fatal_valist() if the daemon monitoring
>>>   * facility should automatically restart the current daemon.  */
>>> @@ -1289,6 +1291,10 @@ vlog_abort_valist(const struct vlog_module *module_,
>>>       * message written by the later ovs_abort_valist(). */
>>>      module->levels[VLF_CONSOLE] = VLL_OFF;
>>>  
>>> +    /* Printing the stack trace before the 'message', because the 'message'
>>> +     * will flush the async log queue (VLL_EMER).  With a different order we
>>> +     * would need to flush the queue manually again. */
>>> +    log_backtrace();
>>>      vlog_valist(module, VLL_EMER, message, args);
>>>      ovs_abort_valist(0, message, args);
>>>  }
>>
>> Is it worth adding to vlog_fatal_valist() as well for VLOG_FATAL()?
>>
>> If there's some reason not to, then LGTM as is.
> 
> VLOG_FATAL is used more widely for user-errors or environment issues.
> It's not appropriate to print stack traces in such scenarios.
> For example, I don't think we want to dump the trace when users provide
> incorrect command line arguments in tools (VLOG_FATAL is used for that).
> 
> Abort, OTOH, usually signifies a programming error and should not actually
> be invoked under normal circumstances, so it should be fine to dump the
> stack there.
> 

ok, that sounds reasonable. Thanks.

>>
>> Acked-by: Kevin Traynor <ktraynor@redhat.com>
>>
>>> diff --git a/tests/library.at b/tests/library.at
>>> index 7b4acebb8..d962e1b3f 100644
>>> --- a/tests/library.at
>>> +++ b/tests/library.at
>>> @@ -230,7 +230,9 @@ AT_CHECK([ovstest test-util -voff -vfile:info '-vPATTERN:file:%c|%p|%m' --log-fi
>>>    [$exit_status], [], [stderr])
>>>  
>>>  AT_CHECK([sed 's/\(opened log file\) .*/\1/
>>> -s/|[[^|]]*: /|/' test-util.log], [0], [dnl
>>> +s/|[[^|]]*: /|/
>>> +/backtrace/d
>>> +/|.*|/!d' test-util.log], [0], [dnl
>>>  vlog|INFO|opened log file
>>>  util|EMER|assertion false failed in test_assert()
>>>  ])
>>
>
Simon Horman April 10, 2024, 4:24 p.m. UTC | #5
On Wed, Apr 10, 2024 at 02:10:20PM +0200, Ilya Maximets wrote:
> On 4/6/24 00:08, Ilya Maximets wrote:
> > Currently, calls like ovs_assert() just print out a condition that
> > caused assertion to fail.  But it may be not enough to understand what
> > exactly has happened, especially if assertion failed in some generic
> > function like dp_packet_resize() or similar.
> > 
> > Print the stack trace along with the abort message to give more context
> > for the later debugging.
> > 
> > This should be especially useful in case the issue happens in an
> > environment with core dumps disabled.
> > 
> > Adding the log to vlog_abort() to cover a little more cases where
> > VLOG_ABORT is called and not only assertion failures.
> > 
> > It would be nice to also have stack traces in case of reaching the
> > OVS_NOT_REACHED().  But this macro is used in some places as a last
> > resort and should not actually do more than just stopping the process
> > immediately.  And it also can be used in contexts without logging
> > initialized.  Such a change will need to be done more carefully.
> > Better solution might be to use VLOG_ABORT() where appropriate instead.
> > 
> > Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> > ---
> >  lib/vlog.c       | 10 ++++++++--
> >  tests/library.at |  4 +++-
> >  2 files changed, 11 insertions(+), 3 deletions(-)
> 
> If this change is accepted, I'd also suggest backporting it to 3.3.
> It is a debug-only change that touches only the code executed under
> fatal failure conditions, so should be safe enough.  Backporting
> will allow us easier debugging in to-be-LTS release.  And also OVN
> 24.03 LTS can make use of it this way as well.

Acked-by: Simon Horman <horms@ovn.org>
Ilya Maximets April 10, 2024, 8:27 p.m. UTC | #6
On 4/10/24 18:24, Simon Horman wrote:
> On Wed, Apr 10, 2024 at 02:10:20PM +0200, Ilya Maximets wrote:
>> On 4/6/24 00:08, Ilya Maximets wrote:
>>> Currently, calls like ovs_assert() just print out a condition that
>>> caused assertion to fail.  But it may be not enough to understand what
>>> exactly has happened, especially if assertion failed in some generic
>>> function like dp_packet_resize() or similar.
>>>
>>> Print the stack trace along with the abort message to give more context
>>> for the later debugging.
>>>
>>> This should be especially useful in case the issue happens in an
>>> environment with core dumps disabled.
>>>
>>> Adding the log to vlog_abort() to cover a little more cases where
>>> VLOG_ABORT is called and not only assertion failures.
>>>
>>> It would be nice to also have stack traces in case of reaching the
>>> OVS_NOT_REACHED().  But this macro is used in some places as a last
>>> resort and should not actually do more than just stopping the process
>>> immediately.  And it also can be used in contexts without logging
>>> initialized.  Such a change will need to be done more carefully.
>>> Better solution might be to use VLOG_ABORT() where appropriate instead.
>>>
>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>> ---
>>>  lib/vlog.c       | 10 ++++++++--
>>>  tests/library.at |  4 +++-
>>>  2 files changed, 11 insertions(+), 3 deletions(-)
>>
>> If this change is accepted, I'd also suggest backporting it to 3.3.
>> It is a debug-only change that touches only the code executed under
>> fatal failure conditions, so should be safe enough.  Backporting
>> will allow us easier debugging in to-be-LTS release.  And also OVN
>> 24.03 LTS can make use of it this way as well.
> 
> Acked-by: Simon Horman <horms@ovn.org>
> 

Thanks, Simon and Kevin!

Applied to main and branch-3.3.

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/lib/vlog.c b/lib/vlog.c
index b2653142f..e78c785f7 100644
--- a/lib/vlog.c
+++ b/lib/vlog.c
@@ -29,6 +29,7 @@ 
 #include <time.h>
 #include <unistd.h>
 #include "async-append.h"
+#include "backtrace.h"
 #include "coverage.h"
 #include "dirs.h"
 #include "openvswitch/dynamic-string.h"
@@ -1274,8 +1275,9 @@  vlog_fatal(const struct vlog_module *module, const char *message, ...)
     va_end(args);
 }
 
-/* Logs 'message' to 'module' at maximum verbosity, then calls abort().  Always
- * writes the message to stderr, even if the console destination is disabled.
+/* Attempts to log a stack trace, logs 'message' to 'module' at maximum
+ * verbosity, then calls abort().  Always writes the message to stderr, even
+ * if the console destination is disabled.
  *
  * Choose this function instead of vlog_fatal_valist() if the daemon monitoring
  * facility should automatically restart the current daemon.  */
@@ -1289,6 +1291,10 @@  vlog_abort_valist(const struct vlog_module *module_,
      * message written by the later ovs_abort_valist(). */
     module->levels[VLF_CONSOLE] = VLL_OFF;
 
+    /* Printing the stack trace before the 'message', because the 'message'
+     * will flush the async log queue (VLL_EMER).  With a different order we
+     * would need to flush the queue manually again. */
+    log_backtrace();
     vlog_valist(module, VLL_EMER, message, args);
     ovs_abort_valist(0, message, args);
 }
diff --git a/tests/library.at b/tests/library.at
index 7b4acebb8..d962e1b3f 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -230,7 +230,9 @@  AT_CHECK([ovstest test-util -voff -vfile:info '-vPATTERN:file:%c|%p|%m' --log-fi
   [$exit_status], [], [stderr])
 
 AT_CHECK([sed 's/\(opened log file\) .*/\1/
-s/|[[^|]]*: /|/' test-util.log], [0], [dnl
+s/|[[^|]]*: /|/
+/backtrace/d
+/|.*|/!d' test-util.log], [0], [dnl
 vlog|INFO|opened log file
 util|EMER|assertion false failed in test_assert()
 ])