diff mbox series

[ovs-dev,v1] ovs-ctl: Allow inclusion of hugepages in coredumps

Message ID 20220325221712.567255-1-mkp@redhat.com
State Changes Requested
Headers show
Series [ovs-dev,v1] ovs-ctl: Allow inclusion of hugepages in coredumps | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Mike Pattrick March 25, 2022, 10:17 p.m. UTC
Add new option --dump-hugepages option in ovs-ctl to enable the addition
of hugepages in the core dump filter.

Signed-off-by: Mike Pattrick <mkp@redhat.com>
---
 NEWS                 |  4 ++++
 utilities/ovs-ctl.in | 15 +++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

Comments

Adrian Moreno Nov. 25, 2022, 5:19 p.m. UTC | #1
Hi Mike,

Sorry it took that long to review this patch.

On 3/25/22 23:17, Mike Pattrick wrote:
> Add new option --dump-hugepages option in ovs-ctl to enable the addition
> of hugepages in the core dump filter.
> 
> Signed-off-by: Mike Pattrick <mkp@redhat.com>
> ---
>   NEWS                 |  4 ++++
>   utilities/ovs-ctl.in | 15 +++++++++++----
>   2 files changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index 8fa57836a..7af60dce3 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -3,6 +3,10 @@ Post-v2.17.0
>      - OVSDB:
>        * 'relay' service model now supports transaction history, i.e. honors the
>          'last-txn-id' field in 'monitor_cond_since' requests from clients.
> +   - ovs-ctl:
> +     * New option '--dump-hugepages' to include hugepages in core dumps. This
> +       can assist with postmortem analysis involving DPDK, but may also produce
> +       significantly larger core dump files.
>   

I'm afraid this part needs rebasing.

>   
>   v2.17.0 - 17 Feb 2022
> diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in
> index e6e07f476..8f900314b 100644
> --- a/utilities/ovs-ctl.in
> +++ b/utilities/ovs-ctl.in
> @@ -103,8 +103,13 @@ set_system_ids () {
>       action "Configuring Open vSwitch system IDs" "$@" $extra_ids
>   }
>   
> -check_force_cores () {
> -    if test X"$FORCE_COREFILES" = Xyes; then
> +check_core_config () {
> +    if test X"$DUMP_HUGEPAGES" = Xyes; then
> +        echo 0x3f > /proc/self/coredump_filter
> +        if test X"$FORCE_COREFILES" = Xyes; then
> +            ulimit -c unlimited
> +        fi
> +    elif test X"$FORCE_COREFILES" = Xyes; then
>           ulimit -c 67108864
>       fi
>   }
> @@ -116,7 +121,7 @@ del_transient_ports () {
>   }
>   
>   do_start_ovsdb () {
> -    check_force_cores
> +    check_core_config
>   
>       if daemon_is_running ovsdb-server; then
>           log_success_msg "ovsdb-server is already running"
> @@ -193,7 +198,7 @@ add_managers () {
>   }
>   
>   do_start_forwarding () {
> -    check_force_cores
> +    check_core_config
>   
>       insert_mod_if_required || return 1
>   
> @@ -330,6 +335,7 @@ set_defaults () {
>   
>       DAEMON_CWD=/
>       FORCE_COREFILES=yes
> +    DUMP_HUGEPAGES=no
>       MLOCKALL=yes
>       SELF_CONFINEMENT=yes
>       MONITOR=yes
> @@ -419,6 +425,7 @@ Other important options for "start", "restart" and "force-reload-kmod":
>   Less important options for "start", "restart" and "force-reload-kmod":
>     --daemon-cwd=DIR               set working dir for OVS daemons (default: $DAEMON_CWD)
>     --no-force-corefiles           do not force on core dumps for OVS daemons
> +  --dump-hugepages               include hugepages in coredumps
>     --no-mlockall                  do not lock all of ovs-vswitchd into memory
>     --ovsdb-server-priority=NICE   set ovsdb-server's niceness (default: $OVSDB_SERVER_PRIORITY)
>     --ovsdb-server-options=OPTIONS additional options for ovsdb-server (example: '-vconsole:dbg -vfile:dbg')
> 

Tested locally and verified that with the option hugepages appear in coredumps.
Apart from the need to rebase the NEWS, the patch looks good to me.

Acked-by: Adrian Moreno <amorenoz@redhat.com>

--
Adrián Moreno
Ilya Maximets Nov. 30, 2022, 12:27 p.m. UTC | #2
On 11/25/22 18:19, Adrian Moreno wrote:
> Hi Mike,
> 
> Sorry it took that long to review this patch.
> 
> On 3/25/22 23:17, Mike Pattrick wrote:
>> Add new option --dump-hugepages option in ovs-ctl to enable the addition
>> of hugepages in the core dump filter.
>>
>> Signed-off-by: Mike Pattrick <mkp@redhat.com>
>> ---
>>   NEWS                 |  4 ++++
>>   utilities/ovs-ctl.in | 15 +++++++++++----
>>   2 files changed, 15 insertions(+), 4 deletions(-)
>>
>> diff --git a/NEWS b/NEWS
>> index 8fa57836a..7af60dce3 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -3,6 +3,10 @@ Post-v2.17.0
>>      - OVSDB:
>>        * 'relay' service model now supports transaction history, i.e. honors the
>>          'last-txn-id' field in 'monitor_cond_since' requests from clients.
>> +   - ovs-ctl:
>> +     * New option '--dump-hugepages' to include hugepages in core dumps. This
>> +       can assist with postmortem analysis involving DPDK, but may also produce
>> +       significantly larger core dump files.
>>   
> 
> I'm afraid this part needs rebasing.
> 
>>     v2.17.0 - 17 Feb 2022
>> diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in
>> index e6e07f476..8f900314b 100644
>> --- a/utilities/ovs-ctl.in
>> +++ b/utilities/ovs-ctl.in
>> @@ -103,8 +103,13 @@ set_system_ids () {
>>       action "Configuring Open vSwitch system IDs" "$@" $extra_ids
>>   }
>>   -check_force_cores () {
>> -    if test X"$FORCE_COREFILES" = Xyes; then
>> +check_core_config () {
>> +    if test X"$DUMP_HUGEPAGES" = Xyes; then
>> +        echo 0x3f > /proc/self/coredump_filter

Shouldn't this be 0x7f instead?
0x3f doesn't enable bit #6, which is responsible for dumping
shared huge pages.  Or am I missing something?

Best regards, Ilya Maximets.

>> +        if test X"$FORCE_COREFILES" = Xyes; then
>> +            ulimit -c unlimited
>> +        fi
>> +    elif test X"$FORCE_COREFILES" = Xyes; then
>>           ulimit -c 67108864
>>       fi
>>   }
>> @@ -116,7 +121,7 @@ del_transient_ports () {
>>   }
>>     do_start_ovsdb () {
>> -    check_force_cores
>> +    check_core_config
>>         if daemon_is_running ovsdb-server; then
>>           log_success_msg "ovsdb-server is already running"
>> @@ -193,7 +198,7 @@ add_managers () {
>>   }
>>     do_start_forwarding () {
>> -    check_force_cores
>> +    check_core_config
>>         insert_mod_if_required || return 1
>>   @@ -330,6 +335,7 @@ set_defaults () {
>>         DAEMON_CWD=/
>>       FORCE_COREFILES=yes
>> +    DUMP_HUGEPAGES=no
>>       MLOCKALL=yes
>>       SELF_CONFINEMENT=yes
>>       MONITOR=yes
>> @@ -419,6 +425,7 @@ Other important options for "start", "restart" and "force-reload-kmod":
>>   Less important options for "start", "restart" and "force-reload-kmod":
>>     --daemon-cwd=DIR               set working dir for OVS daemons (default: $DAEMON_CWD)
>>     --no-force-corefiles           do not force on core dumps for OVS daemons
>> +  --dump-hugepages               include hugepages in coredumps
>>     --no-mlockall                  do not lock all of ovs-vswitchd into memory
>>     --ovsdb-server-priority=NICE   set ovsdb-server's niceness (default: $OVSDB_SERVER_PRIORITY)
>>     --ovsdb-server-options=OPTIONS additional options for ovsdb-server (example: '-vconsole:dbg -vfile:dbg')
>>
> 
> Tested locally and verified that with the option hugepages appear in coredumps.
> Apart from the need to rebase the NEWS, the patch looks good to me.
> 
> Acked-by: Adrian Moreno <amorenoz@redhat.com>
> 
> -- 
> Adrián Moreno
Mike Pattrick Nov. 30, 2022, 7:11 p.m. UTC | #3
On Wed, Nov 30, 2022 at 7:27 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 11/25/22 18:19, Adrian Moreno wrote:
> > Hi Mike,
> >
> > Sorry it took that long to review this patch.
> >
> > On 3/25/22 23:17, Mike Pattrick wrote:
> >> Add new option --dump-hugepages option in ovs-ctl to enable the addition
> >> of hugepages in the core dump filter.
> >>
> >> Signed-off-by: Mike Pattrick <mkp@redhat.com>
> >> ---
> >>   NEWS                 |  4 ++++
> >>   utilities/ovs-ctl.in | 15 +++++++++++----
> >>   2 files changed, 15 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/NEWS b/NEWS
> >> index 8fa57836a..7af60dce3 100644
> >> --- a/NEWS
> >> +++ b/NEWS
> >> @@ -3,6 +3,10 @@ Post-v2.17.0
> >>      - OVSDB:
> >>        * 'relay' service model now supports transaction history, i.e. honors the
> >>          'last-txn-id' field in 'monitor_cond_since' requests from clients.
> >> +   - ovs-ctl:
> >> +     * New option '--dump-hugepages' to include hugepages in core dumps. This
> >> +       can assist with postmortem analysis involving DPDK, but may also produce
> >> +       significantly larger core dump files.
> >>
> >
> > I'm afraid this part needs rebasing.
> >
> >>     v2.17.0 - 17 Feb 2022
> >> diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in
> >> index e6e07f476..8f900314b 100644
> >> --- a/utilities/ovs-ctl.in
> >> +++ b/utilities/ovs-ctl.in
> >> @@ -103,8 +103,13 @@ set_system_ids () {
> >>       action "Configuring Open vSwitch system IDs" "$@" $extra_ids
> >>   }
> >>   -check_force_cores () {
> >> -    if test X"$FORCE_COREFILES" = Xyes; then
> >> +check_core_config () {
> >> +    if test X"$DUMP_HUGEPAGES" = Xyes; then
> >> +        echo 0x3f > /proc/self/coredump_filter
>
> Shouldn't this be 0x7f instead?
> 0x3f doesn't enable bit #6, which is responsible for dumping
> shared huge pages.  Or am I missing something?

That's a good point, the hugepage may or may not be private. I'll send
in a new one.

Cheers,
M

>
> Best regards, Ilya Maximets.
>
> >> +        if test X"$FORCE_COREFILES" = Xyes; then
> >> +            ulimit -c unlimited
> >> +        fi
> >> +    elif test X"$FORCE_COREFILES" = Xyes; then
> >>           ulimit -c 67108864
> >>       fi
> >>   }
> >> @@ -116,7 +121,7 @@ del_transient_ports () {
> >>   }
> >>     do_start_ovsdb () {
> >> -    check_force_cores
> >> +    check_core_config
> >>         if daemon_is_running ovsdb-server; then
> >>           log_success_msg "ovsdb-server is already running"
> >> @@ -193,7 +198,7 @@ add_managers () {
> >>   }
> >>     do_start_forwarding () {
> >> -    check_force_cores
> >> +    check_core_config
> >>         insert_mod_if_required || return 1
> >>   @@ -330,6 +335,7 @@ set_defaults () {
> >>         DAEMON_CWD=/
> >>       FORCE_COREFILES=yes
> >> +    DUMP_HUGEPAGES=no
> >>       MLOCKALL=yes
> >>       SELF_CONFINEMENT=yes
> >>       MONITOR=yes
> >> @@ -419,6 +425,7 @@ Other important options for "start", "restart" and "force-reload-kmod":
> >>   Less important options for "start", "restart" and "force-reload-kmod":
> >>     --daemon-cwd=DIR               set working dir for OVS daemons (default: $DAEMON_CWD)
> >>     --no-force-corefiles           do not force on core dumps for OVS daemons
> >> +  --dump-hugepages               include hugepages in coredumps
> >>     --no-mlockall                  do not lock all of ovs-vswitchd into memory
> >>     --ovsdb-server-priority=NICE   set ovsdb-server's niceness (default: $OVSDB_SERVER_PRIORITY)
> >>     --ovsdb-server-options=OPTIONS additional options for ovsdb-server (example: '-vconsole:dbg -vfile:dbg')
> >>
> >
> > Tested locally and verified that with the option hugepages appear in coredumps.
> > Apart from the need to rebase the NEWS, the patch looks good to me.
> >
> > Acked-by: Adrian Moreno <amorenoz@redhat.com>
> >
> > --
> > Adrián Moreno
>
Ilya Maximets Nov. 30, 2022, 7:45 p.m. UTC | #4
On 11/30/22 20:11, Mike Pattrick wrote:
> On Wed, Nov 30, 2022 at 7:27 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 11/25/22 18:19, Adrian Moreno wrote:
>>> Hi Mike,
>>>
>>> Sorry it took that long to review this patch.
>>>
>>> On 3/25/22 23:17, Mike Pattrick wrote:
>>>> Add new option --dump-hugepages option in ovs-ctl to enable the addition
>>>> of hugepages in the core dump filter.
>>>>
>>>> Signed-off-by: Mike Pattrick <mkp@redhat.com>
>>>> ---
>>>>   NEWS                 |  4 ++++
>>>>   utilities/ovs-ctl.in | 15 +++++++++++----
>>>>   2 files changed, 15 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/NEWS b/NEWS
>>>> index 8fa57836a..7af60dce3 100644
>>>> --- a/NEWS
>>>> +++ b/NEWS
>>>> @@ -3,6 +3,10 @@ Post-v2.17.0
>>>>      - OVSDB:
>>>>        * 'relay' service model now supports transaction history, i.e. honors the
>>>>          'last-txn-id' field in 'monitor_cond_since' requests from clients.
>>>> +   - ovs-ctl:
>>>> +     * New option '--dump-hugepages' to include hugepages in core dumps. This
>>>> +       can assist with postmortem analysis involving DPDK, but may also produce
>>>> +       significantly larger core dump files.
>>>>
>>>
>>> I'm afraid this part needs rebasing.
>>>
>>>>     v2.17.0 - 17 Feb 2022
>>>> diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in
>>>> index e6e07f476..8f900314b 100644
>>>> --- a/utilities/ovs-ctl.in
>>>> +++ b/utilities/ovs-ctl.in
>>>> @@ -103,8 +103,13 @@ set_system_ids () {
>>>>       action "Configuring Open vSwitch system IDs" "$@" $extra_ids
>>>>   }
>>>>   -check_force_cores () {
>>>> -    if test X"$FORCE_COREFILES" = Xyes; then
>>>> +check_core_config () {
>>>> +    if test X"$DUMP_HUGEPAGES" = Xyes; then
>>>> +        echo 0x3f > /proc/self/coredump_filter
>>
>> Shouldn't this be 0x7f instead?
>> 0x3f doesn't enable bit #6, which is responsible for dumping
>> shared huge pages.  Or am I missing something?
> 
> That's a good point, the hugepage may or may not be private. I'll send
> in a new one.

OK.  One thing to think about though is that we'll grab
VM memory, I guess, in case we have vhost-user ports.
So, the core dump size can become insanely huge.

The downside of not having them is inability to inspect
virtqueues and stuff in the dump.

> 
> Cheers,
> M
> 
>>
>> Best regards, Ilya Maximets.
>>
>>>> +        if test X"$FORCE_COREFILES" = Xyes; then
>>>> +            ulimit -c unlimited
>>>> +        fi
>>>> +    elif test X"$FORCE_COREFILES" = Xyes; then
>>>>           ulimit -c 67108864
>>>>       fi
>>>>   }
>>>> @@ -116,7 +121,7 @@ del_transient_ports () {
>>>>   }
>>>>     do_start_ovsdb () {
>>>> -    check_force_cores
>>>> +    check_core_config
>>>>         if daemon_is_running ovsdb-server; then
>>>>           log_success_msg "ovsdb-server is already running"
>>>> @@ -193,7 +198,7 @@ add_managers () {
>>>>   }
>>>>     do_start_forwarding () {
>>>> -    check_force_cores
>>>> +    check_core_config
>>>>         insert_mod_if_required || return 1
>>>>   @@ -330,6 +335,7 @@ set_defaults () {
>>>>         DAEMON_CWD=/
>>>>       FORCE_COREFILES=yes
>>>> +    DUMP_HUGEPAGES=no
>>>>       MLOCKALL=yes
>>>>       SELF_CONFINEMENT=yes
>>>>       MONITOR=yes
>>>> @@ -419,6 +425,7 @@ Other important options for "start", "restart" and "force-reload-kmod":
>>>>   Less important options for "start", "restart" and "force-reload-kmod":
>>>>     --daemon-cwd=DIR               set working dir for OVS daemons (default: $DAEMON_CWD)
>>>>     --no-force-corefiles           do not force on core dumps for OVS daemons
>>>> +  --dump-hugepages               include hugepages in coredumps
>>>>     --no-mlockall                  do not lock all of ovs-vswitchd into memory
>>>>     --ovsdb-server-priority=NICE   set ovsdb-server's niceness (default: $OVSDB_SERVER_PRIORITY)
>>>>     --ovsdb-server-options=OPTIONS additional options for ovsdb-server (example: '-vconsole:dbg -vfile:dbg')
>>>>
>>>
>>> Tested locally and verified that with the option hugepages appear in coredumps.
>>> Apart from the need to rebase the NEWS, the patch looks good to me.
>>>
>>> Acked-by: Adrian Moreno <amorenoz@redhat.com>
>>>
>>> --
>>> Adrián Moreno
>>
>
David Marchand Nov. 30, 2022, 7:52 p.m. UTC | #5
On Wed, Nov 30, 2022 at 8:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 11/30/22 20:11, Mike Pattrick wrote:
> > On Wed, Nov 30, 2022 at 7:27 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >> On 11/25/22 18:19, Adrian Moreno wrote:
> >>> Hi Mike,
> >>>
> >>> Sorry it took that long to review this patch.
> >>>
> >>> On 3/25/22 23:17, Mike Pattrick wrote:
> >>>> Add new option --dump-hugepages option in ovs-ctl to enable the addition
> >>>> of hugepages in the core dump filter.
> >>>>
> >>>> Signed-off-by: Mike Pattrick <mkp@redhat.com>
> >>>> ---
> >>>>   NEWS                 |  4 ++++
> >>>>   utilities/ovs-ctl.in | 15 +++++++++++----
> >>>>   2 files changed, 15 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/NEWS b/NEWS
> >>>> index 8fa57836a..7af60dce3 100644
> >>>> --- a/NEWS
> >>>> +++ b/NEWS
> >>>> @@ -3,6 +3,10 @@ Post-v2.17.0
> >>>>      - OVSDB:
> >>>>        * 'relay' service model now supports transaction history, i.e. honors the
> >>>>          'last-txn-id' field in 'monitor_cond_since' requests from clients.
> >>>> +   - ovs-ctl:
> >>>> +     * New option '--dump-hugepages' to include hugepages in core dumps. This
> >>>> +       can assist with postmortem analysis involving DPDK, but may also produce
> >>>> +       significantly larger core dump files.
> >>>>
> >>>
> >>> I'm afraid this part needs rebasing.
> >>>
> >>>>     v2.17.0 - 17 Feb 2022
> >>>> diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in
> >>>> index e6e07f476..8f900314b 100644
> >>>> --- a/utilities/ovs-ctl.in
> >>>> +++ b/utilities/ovs-ctl.in
> >>>> @@ -103,8 +103,13 @@ set_system_ids () {
> >>>>       action "Configuring Open vSwitch system IDs" "$@" $extra_ids
> >>>>   }
> >>>>   -check_force_cores () {
> >>>> -    if test X"$FORCE_COREFILES" = Xyes; then
> >>>> +check_core_config () {
> >>>> +    if test X"$DUMP_HUGEPAGES" = Xyes; then
> >>>> +        echo 0x3f > /proc/self/coredump_filter
> >>
> >> Shouldn't this be 0x7f instead?
> >> 0x3f doesn't enable bit #6, which is responsible for dumping
> >> shared huge pages.  Or am I missing something?
> >
> > That's a good point, the hugepage may or may not be private. I'll send
> > in a new one.
>
> OK.  One thing to think about though is that we'll grab
> VM memory, I guess, in case we have vhost-user ports.
> So, the core dump size can become insanely huge.
>
> The downside of not having them is inability to inspect
> virtqueues and stuff in the dump.

Did you consider madvise()?

       MADV_DONTDUMP (since Linux 3.4)
              Exclude from a core dump those pages in the range
specified by addr and length.  This is useful in applications that
have large areas of memory that are known not to be useful in a core
dump.  The effect of  MADV_DONT‐
              DUMP takes precedence over the bit mask that is set via
the /proc/[pid]/coredump_filter file (see core(5)).

       MADV_DODUMP (since Linux 3.4)
              Undo the effect of an earlier MADV_DONTDUMP.
Ilya Maximets Nov. 30, 2022, 8:30 p.m. UTC | #6
On 11/30/22 20:52, David Marchand wrote:
> On Wed, Nov 30, 2022 at 8:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 11/30/22 20:11, Mike Pattrick wrote:
>>> On Wed, Nov 30, 2022 at 7:27 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>
>>>> On 11/25/22 18:19, Adrian Moreno wrote:
>>>>> Hi Mike,
>>>>>
>>>>> Sorry it took that long to review this patch.
>>>>>
>>>>> On 3/25/22 23:17, Mike Pattrick wrote:
>>>>>> Add new option --dump-hugepages option in ovs-ctl to enable the addition
>>>>>> of hugepages in the core dump filter.
>>>>>>
>>>>>> Signed-off-by: Mike Pattrick <mkp@redhat.com>
>>>>>> ---
>>>>>>   NEWS                 |  4 ++++
>>>>>>   utilities/ovs-ctl.in | 15 +++++++++++----
>>>>>>   2 files changed, 15 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/NEWS b/NEWS
>>>>>> index 8fa57836a..7af60dce3 100644
>>>>>> --- a/NEWS
>>>>>> +++ b/NEWS
>>>>>> @@ -3,6 +3,10 @@ Post-v2.17.0
>>>>>>      - OVSDB:
>>>>>>        * 'relay' service model now supports transaction history, i.e. honors the
>>>>>>          'last-txn-id' field in 'monitor_cond_since' requests from clients.
>>>>>> +   - ovs-ctl:
>>>>>> +     * New option '--dump-hugepages' to include hugepages in core dumps. This
>>>>>> +       can assist with postmortem analysis involving DPDK, but may also produce
>>>>>> +       significantly larger core dump files.
>>>>>>
>>>>>
>>>>> I'm afraid this part needs rebasing.
>>>>>
>>>>>>     v2.17.0 - 17 Feb 2022
>>>>>> diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in
>>>>>> index e6e07f476..8f900314b 100644
>>>>>> --- a/utilities/ovs-ctl.in
>>>>>> +++ b/utilities/ovs-ctl.in
>>>>>> @@ -103,8 +103,13 @@ set_system_ids () {
>>>>>>       action "Configuring Open vSwitch system IDs" "$@" $extra_ids
>>>>>>   }
>>>>>>   -check_force_cores () {
>>>>>> -    if test X"$FORCE_COREFILES" = Xyes; then
>>>>>> +check_core_config () {
>>>>>> +    if test X"$DUMP_HUGEPAGES" = Xyes; then
>>>>>> +        echo 0x3f > /proc/self/coredump_filter
>>>>
>>>> Shouldn't this be 0x7f instead?
>>>> 0x3f doesn't enable bit #6, which is responsible for dumping
>>>> shared huge pages.  Or am I missing something?
>>>
>>> That's a good point, the hugepage may or may not be private. I'll send
>>> in a new one.
>>
>> OK.  One thing to think about though is that we'll grab
>> VM memory, I guess, in case we have vhost-user ports.
>> So, the core dump size can become insanely huge.
>>
>> The downside of not having them is inability to inspect
>> virtqueues and stuff in the dump.
> 
> Did you consider madvise()?
> 
>        MADV_DONTDUMP (since Linux 3.4)
>               Exclude from a core dump those pages in the range
> specified by addr and length.  This is useful in applications that
> have large areas of memory that are known not to be useful in a core
> dump.  The effect of  MADV_DONT‐
>               DUMP takes precedence over the bit mask that is set via
> the /proc/[pid]/coredump_filter file (see core(5)).
> 
>        MADV_DODUMP (since Linux 3.4)
>               Undo the effect of an earlier MADV_DONTDUMP.

I don't think OVS actually knows location of particular VM memory
pages that we do not need.  And dumping virtqueues and stuff is,
probably, the point of this patch (?).

vhost-user library might have a better idea on which particular parts
of the memory guest may use for virtqueues and buffers, but I'm not
100% sure.

Best regards, Ilya Maximets.
David Marchand Dec. 2, 2022, 10:09 a.m. UTC | #7
On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>> Shouldn't this be 0x7f instead?
> >>>> 0x3f doesn't enable bit #6, which is responsible for dumping
> >>>> shared huge pages.  Or am I missing something?
> >>>
> >>> That's a good point, the hugepage may or may not be private. I'll send
> >>> in a new one.
> >>
> >> OK.  One thing to think about though is that we'll grab
> >> VM memory, I guess, in case we have vhost-user ports.
> >> So, the core dump size can become insanely huge.
> >>
> >> The downside of not having them is inability to inspect
> >> virtqueues and stuff in the dump.
> >
> > Did you consider madvise()?
> >
> >        MADV_DONTDUMP (since Linux 3.4)
> >               Exclude from a core dump those pages in the range
> > specified by addr and length.  This is useful in applications that
> > have large areas of memory that are known not to be useful in a core
> > dump.  The effect of  MADV_DONT‐
> >               DUMP takes precedence over the bit mask that is set via
> > the /proc/[pid]/coredump_filter file (see core(5)).
> >
> >        MADV_DODUMP (since Linux 3.4)
> >               Undo the effect of an earlier MADV_DONTDUMP.
>
> I don't think OVS actually knows location of particular VM memory
> pages that we do not need.  And dumping virtqueues and stuff is,
> probably, the point of this patch (?).
>
> vhost-user library might have a better idea on which particular parts
> of the memory guest may use for virtqueues and buffers, but I'm not
> 100% sure.

Yes, distinguishing hugepages of interest is a problem.

Since v20.05, DPDK mem allocator takes care of excluding (unused)
hugepages from dump.
So with this OVS patch, if we catch private and shared hugepages,
"interesting" DPDK hugepages will get dumped, which is useful for
debugging post mortem.

Adding Maxime, who will have a better idea of what is possible for the
guest mapping part.
Maxime Coquelin Dec. 2, 2022, 10:36 a.m. UTC | #8
On 12/2/22 11:09, David Marchand wrote:
> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>> Shouldn't this be 0x7f instead?
>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping
>>>>>> shared huge pages.  Or am I missing something?
>>>>>
>>>>> That's a good point, the hugepage may or may not be private. I'll send
>>>>> in a new one.
>>>>
>>>> OK.  One thing to think about though is that we'll grab
>>>> VM memory, I guess, in case we have vhost-user ports.
>>>> So, the core dump size can become insanely huge.
>>>>
>>>> The downside of not having them is inability to inspect
>>>> virtqueues and stuff in the dump.
>>>
>>> Did you consider madvise()?
>>>
>>>         MADV_DONTDUMP (since Linux 3.4)
>>>                Exclude from a core dump those pages in the range
>>> specified by addr and length.  This is useful in applications that
>>> have large areas of memory that are known not to be useful in a core
>>> dump.  The effect of  MADV_DONT‐
>>>                DUMP takes precedence over the bit mask that is set via
>>> the /proc/[pid]/coredump_filter file (see core(5)).
>>>
>>>         MADV_DODUMP (since Linux 3.4)
>>>                Undo the effect of an earlier MADV_DONTDUMP.
>>
>> I don't think OVS actually knows location of particular VM memory
>> pages that we do not need.  And dumping virtqueues and stuff is,
>> probably, the point of this patch (?).
>>
>> vhost-user library might have a better idea on which particular parts
>> of the memory guest may use for virtqueues and buffers, but I'm not
>> 100% sure.
> 
> Yes, distinguishing hugepages of interest is a problem.
> 
> Since v20.05, DPDK mem allocator takes care of excluding (unused)
> hugepages from dump.
> So with this OVS patch, if we catch private and shared hugepages,
> "interesting" DPDK hugepages will get dumped, which is useful for
> debugging post mortem.
> 
> Adding Maxime, who will have a better idea of what is possible for the
> guest mapping part.
> 
> 

I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap
time, then there are two cases:
   a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues
memory. Doing so, we would have the rings memory, but not their buffers
(except if they are located on same hugepages).
   b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE
new entries and MADV_DONTDUMP on invalidated entries. Doing so we will
get both vrings and their buffers the backend is allowed to access.

I can prepare a PoC quickly if someone is willing to experiment.

Regards,
Maxime
Mike Pattrick Dec. 2, 2022, 2:59 p.m. UTC | #9
On Fri, Dec 2, 2022 at 5:37 AM Maxime Coquelin
<maxime.coquelin@redhat.com> wrote:
>
>
>
> On 12/2/22 11:09, David Marchand wrote:
> > On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>> Shouldn't this be 0x7f instead?
> >>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping
> >>>>>> shared huge pages.  Or am I missing something?
> >>>>>
> >>>>> That's a good point, the hugepage may or may not be private. I'll send
> >>>>> in a new one.
> >>>>
> >>>> OK.  One thing to think about though is that we'll grab
> >>>> VM memory, I guess, in case we have vhost-user ports.
> >>>> So, the core dump size can become insanely huge.
> >>>>
> >>>> The downside of not having them is inability to inspect
> >>>> virtqueues and stuff in the dump.
> >>>
> >>> Did you consider madvise()?
> >>>
> >>>         MADV_DONTDUMP (since Linux 3.4)
> >>>                Exclude from a core dump those pages in the range
> >>> specified by addr and length.  This is useful in applications that
> >>> have large areas of memory that are known not to be useful in a core
> >>> dump.  The effect of  MADV_DONT‐
> >>>                DUMP takes precedence over the bit mask that is set via
> >>> the /proc/[pid]/coredump_filter file (see core(5)).
> >>>
> >>>         MADV_DODUMP (since Linux 3.4)
> >>>                Undo the effect of an earlier MADV_DONTDUMP.
> >>
> >> I don't think OVS actually knows location of particular VM memory
> >> pages that we do not need.  And dumping virtqueues and stuff is,
> >> probably, the point of this patch (?).
> >>
> >> vhost-user library might have a better idea on which particular parts
> >> of the memory guest may use for virtqueues and buffers, but I'm not
> >> 100% sure.
> >
> > Yes, distinguishing hugepages of interest is a problem.
> >
> > Since v20.05, DPDK mem allocator takes care of excluding (unused)
> > hugepages from dump.
> > So with this OVS patch, if we catch private and shared hugepages,
> > "interesting" DPDK hugepages will get dumped, which is useful for
> > debugging post mortem.
> >
> > Adding Maxime, who will have a better idea of what is possible for the
> > guest mapping part.
> >
> >
>
> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap
> time, then there are two cases:
>    a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues
> memory. Doing so, we would have the rings memory, but not their buffers
> (except if they are located on same hugepages).
>    b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE
> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will
> get both vrings and their buffers the backend is allowed to access.
>
> I can prepare a PoC quickly if someone is willing to experiment.

A big motivation for this patch is DPDK becomes a big black hole in
coredumps, preventing even netdev structures from being enumerated.

I threw together a small PoC for this yesterday, only marking vhost
mmaps as DONTDUMP. The result was that a simple OVS configuration with
one vhost attached to an 8gb VM dropped the zstd compressed coredump
including shared huge pages from 486.4M to 19.8M. The resulting
coredump was also significantly more debuggable, all internal OVS
datastructures became available as well as many DPDK ones.

I'll look at cleaning things up and submitting to the DPDK mailing list.


Cheers,
M

>
> Regards,
> Maxime
>
>
Maxime Coquelin Dec. 2, 2022, 3:20 p.m. UTC | #10
On 12/2/22 15:59, Mike Pattrick wrote:
> On Fri, Dec 2, 2022 at 5:37 AM Maxime Coquelin
> <maxime.coquelin@redhat.com> wrote:
>>
>>
>>
>> On 12/2/22 11:09, David Marchand wrote:
>>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>> Shouldn't this be 0x7f instead?
>>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping
>>>>>>>> shared huge pages.  Or am I missing something?
>>>>>>>
>>>>>>> That's a good point, the hugepage may or may not be private. I'll send
>>>>>>> in a new one.
>>>>>>
>>>>>> OK.  One thing to think about though is that we'll grab
>>>>>> VM memory, I guess, in case we have vhost-user ports.
>>>>>> So, the core dump size can become insanely huge.
>>>>>>
>>>>>> The downside of not having them is inability to inspect
>>>>>> virtqueues and stuff in the dump.
>>>>>
>>>>> Did you consider madvise()?
>>>>>
>>>>>          MADV_DONTDUMP (since Linux 3.4)
>>>>>                 Exclude from a core dump those pages in the range
>>>>> specified by addr and length.  This is useful in applications that
>>>>> have large areas of memory that are known not to be useful in a core
>>>>> dump.  The effect of  MADV_DONT‐
>>>>>                 DUMP takes precedence over the bit mask that is set via
>>>>> the /proc/[pid]/coredump_filter file (see core(5)).
>>>>>
>>>>>          MADV_DODUMP (since Linux 3.4)
>>>>>                 Undo the effect of an earlier MADV_DONTDUMP.
>>>>
>>>> I don't think OVS actually knows location of particular VM memory
>>>> pages that we do not need.  And dumping virtqueues and stuff is,
>>>> probably, the point of this patch (?).
>>>>
>>>> vhost-user library might have a better idea on which particular parts
>>>> of the memory guest may use for virtqueues and buffers, but I'm not
>>>> 100% sure.
>>>
>>> Yes, distinguishing hugepages of interest is a problem.
>>>
>>> Since v20.05, DPDK mem allocator takes care of excluding (unused)
>>> hugepages from dump.
>>> So with this OVS patch, if we catch private and shared hugepages,
>>> "interesting" DPDK hugepages will get dumped, which is useful for
>>> debugging post mortem.
>>>
>>> Adding Maxime, who will have a better idea of what is possible for the
>>> guest mapping part.
>>>
>>>
>>
>> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap
>> time, then there are two cases:
>>     a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues
>> memory. Doing so, we would have the rings memory, but not their buffers
>> (except if they are located on same hugepages).
>>     b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE
>> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will
>> get both vrings and their buffers the backend is allowed to access.
>>
>> I can prepare a PoC quickly if someone is willing to experiment.
> 
> A big motivation for this patch is DPDK becomes a big black hole in
> coredumps, preventing even netdev structures from being enumerated.
> 
> I threw together a small PoC for this yesterday, only marking vhost
> mmaps as DONTDUMP. The result was that a simple OVS configuration with
> one vhost attached to an 8gb VM dropped the zstd compressed coredump
> including shared huge pages from 486.4M to 19.8M. The resulting
> coredump was also significantly more debuggable, all internal OVS
> datastructures became available as well as many DPDK ones.
> 
> I'll look at cleaning things up and submitting to the DPDK mailing list.

Thanks for doing the PoC, that will definitely help debugging!

I have quickly drafted a patch doing the MADV_DODUMP on the area of
interest for Vhost as described in my initial reply, could you have a 
try with it? (disclaimer: not even build tested)

Maxime

=========================================================================

diff --git a/lib/vhost/iotlb.c b/lib/vhost/iotlb.c
index 6a729e8804..555b4ebba0 100644
--- a/lib/vhost/iotlb.c
+++ b/lib/vhost/iotlb.c
@@ -149,6 +149,7 @@ vhost_user_iotlb_cache_remove_all(struct 
vhost_virtqueue *vq)
         rte_rwlock_write_lock(&vq->iotlb_lock);

         RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+               madvise(node->uaddr, node->size, MADV_DODUMP);
                 TAILQ_REMOVE(&vq->iotlb_list, node, next);
                 vhost_user_iotlb_pool_put(vq, node);
         }
@@ -170,6 +171,7 @@ vhost_user_iotlb_cache_random_evict(struct 
vhost_virtqueue *vq)

         RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
                 if (!entry_idx) {
+                       madvise(node->uaddr, node->size, MADV_DODUMP);
                         TAILQ_REMOVE(&vq->iotlb_list, node, next);
                         vhost_user_iotlb_pool_put(vq, node);
                         vq->iotlb_cache_nr--;
@@ -222,12 +224,14 @@ vhost_user_iotlb_cache_insert(struct virtio_net 
*dev, struct vhost_virtqueue *vq
                         vhost_user_iotlb_pool_put(vq, new_node);
                         goto unlock;
                 } else if (node->iova > new_node->iova) {
+                       madvise(new_node->uaddr, new_node->size, 
MADV_DODUMP);
                         TAILQ_INSERT_BEFORE(node, new_node, next);
                         vq->iotlb_cache_nr++;
                         goto unlock;
                 }
         }

+       madvise(new_node->uaddr, new_node->size, MADV_DODUMP);
         TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next);
         vq->iotlb_cache_nr++;

@@ -255,6 +259,7 @@ vhost_user_iotlb_cache_remove(struct vhost_virtqueue 
*vq,
                         break;

                 if (iova < node->iova + node->size) {
+                       madvise(node->uaddr, node->size, MADV_DODUMP);
                         TAILQ_REMOVE(&vq->iotlb_list, node, next);
                         vhost_user_iotlb_pool_put(vq, node);
                         vq->iotlb_cache_nr--;
diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
index 371d6304d6..208377ddc6 100644
--- a/lib/vhost/vhost_user.c
+++ b/lib/vhost/vhost_user.c
@@ -767,6 +767,8 @@ translate_ring_addresses(struct virtio_net **pdev, 
struct vhost_virtqueue **pvq)
                         return;
                 }

+               madvise(vq->desc_packed, len, MADV_DODUMP);
+
                 numa_realloc(&dev, &vq);
                 *pdev = dev;
                 *pvq = vq;
@@ -782,6 +784,8 @@ translate_ring_addresses(struct virtio_net **pdev, 
struct vhost_virtqueue **pvq)
                         return;
                 }

+               madvise(vq->driver_event, len, MADV_DODUMP);
+
                 len = sizeof(struct vring_packed_desc_event);
                 vq->device_event = (struct vring_packed_desc_event *)
                                         (uintptr_t)ring_addr_to_vva(dev,
@@ -793,6 +797,8 @@ translate_ring_addresses(struct virtio_net **pdev, 
struct vhost_virtqueue **pvq)
                         return;
                 }

+               madvise(vq->device_event, len, MADV_DODUMP);
+
                 vq->access_ok = true;
                 return;
         }
@@ -809,6 +815,8 @@ translate_ring_addresses(struct virtio_net **pdev, 
struct vhost_virtqueue **pvq)
                 return;
         }

+       madvise(vq->desc, len, MADV_DODUMP);
+
         numa_realloc(&dev, &vq);
         *pdev = dev;
         *pvq = vq;
@@ -824,6 +832,8 @@ translate_ring_addresses(struct virtio_net **pdev, 
struct vhost_virtqueue **pvq)
                 return;
         }

+       madvise(vq->avail, len, MADV_DODUMP);
+
         len = sizeof(struct vring_used) +
                 sizeof(struct vring_used_elem) * vq->size;
         if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX))
@@ -836,6 +846,8 @@ translate_ring_addresses(struct virtio_net **pdev, 
struct vhost_virtqueue **pvq)
                 return;
         }

+       madvise(vq->used, len, MADV_DODUMP);
+
         if (vq->last_used_idx != vq->used->idx) {
                 VHOST_LOG_CONFIG(dev->ifname, WARNING,
                         "last_used_idx (%u) and vq->used->idx (%u) 
mismatches;\n",
Mike Pattrick Dec. 2, 2022, 4:44 p.m. UTC | #11
On Fri, Dec 2, 2022 at 10:20 AM Maxime Coquelin
<maxime.coquelin@redhat.com> wrote:
>
>
>
> On 12/2/22 15:59, Mike Pattrick wrote:
> > On Fri, Dec 2, 2022 at 5:37 AM Maxime Coquelin
> > <maxime.coquelin@redhat.com> wrote:
> >>
> >>
> >>
> >> On 12/2/22 11:09, David Marchand wrote:
> >>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>> Shouldn't this be 0x7f instead?
> >>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping
> >>>>>>>> shared huge pages.  Or am I missing something?
> >>>>>>>
> >>>>>>> That's a good point, the hugepage may or may not be private. I'll send
> >>>>>>> in a new one.
> >>>>>>
> >>>>>> OK.  One thing to think about though is that we'll grab
> >>>>>> VM memory, I guess, in case we have vhost-user ports.
> >>>>>> So, the core dump size can become insanely huge.
> >>>>>>
> >>>>>> The downside of not having them is inability to inspect
> >>>>>> virtqueues and stuff in the dump.
> >>>>>
> >>>>> Did you consider madvise()?
> >>>>>
> >>>>>          MADV_DONTDUMP (since Linux 3.4)
> >>>>>                 Exclude from a core dump those pages in the range
> >>>>> specified by addr and length.  This is useful in applications that
> >>>>> have large areas of memory that are known not to be useful in a core
> >>>>> dump.  The effect of  MADV_DONT‐
> >>>>>                 DUMP takes precedence over the bit mask that is set via
> >>>>> the /proc/[pid]/coredump_filter file (see core(5)).
> >>>>>
> >>>>>          MADV_DODUMP (since Linux 3.4)
> >>>>>                 Undo the effect of an earlier MADV_DONTDUMP.
> >>>>
> >>>> I don't think OVS actually knows location of particular VM memory
> >>>> pages that we do not need.  And dumping virtqueues and stuff is,
> >>>> probably, the point of this patch (?).
> >>>>
> >>>> vhost-user library might have a better idea on which particular parts
> >>>> of the memory guest may use for virtqueues and buffers, but I'm not
> >>>> 100% sure.
> >>>
> >>> Yes, distinguishing hugepages of interest is a problem.
> >>>
> >>> Since v20.05, DPDK mem allocator takes care of excluding (unused)
> >>> hugepages from dump.
> >>> So with this OVS patch, if we catch private and shared hugepages,
> >>> "interesting" DPDK hugepages will get dumped, which is useful for
> >>> debugging post mortem.
> >>>
> >>> Adding Maxime, who will have a better idea of what is possible for the
> >>> guest mapping part.
> >>>
> >>>
> >>
> >> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap
> >> time, then there are two cases:
> >>     a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues
> >> memory. Doing so, we would have the rings memory, but not their buffers
> >> (except if they are located on same hugepages).
> >>     b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE
> >> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will
> >> get both vrings and their buffers the backend is allowed to access.
> >>
> >> I can prepare a PoC quickly if someone is willing to experiment.
> >
> > A big motivation for this patch is DPDK becomes a big black hole in
> > coredumps, preventing even netdev structures from being enumerated.
> >
> > I threw together a small PoC for this yesterday, only marking vhost
> > mmaps as DONTDUMP. The result was that a simple OVS configuration with
> > one vhost attached to an 8gb VM dropped the zstd compressed coredump
> > including shared huge pages from 486.4M to 19.8M. The resulting
> > coredump was also significantly more debuggable, all internal OVS
> > datastructures became available as well as many DPDK ones.
> >
> > I'll look at cleaning things up and submitting to the DPDK mailing list.
>
> Thanks for doing the PoC, that will definitely help debugging!
>
> I have quickly drafted a patch doing the MADV_DODUMP on the area of
> interest for Vhost as described in my initial reply, could you have a
> try with it? (disclaimer: not even build tested)

Dump with shared huge pages
Thu 2022-12-01 13:07:33 EST 213904   0   0 SIGSEGV present
/usr/sbin/ovs-vswitchd                                486.4M
uncompressed size: 15,047,430,144

Dump with vhost mmaps set to dontdump
Thu 2022-12-01 13:23:02 EST 217253   0   0 SIGSEGV present
/usr/sbin/ovs-vswitchd                                 19.8M
uncompressed size: 4,310,011,904

Dump with Maxime patch
Fri 2022-12-02 11:34:16 EST 232661   0   0 SIGSEGV present
/usr/sbin/ovs-vswitchd                                 19.7M
uncompressed size: 4,310,007,808

Not sure where the 4kb went from yesterday to today

-M

>
> Maxime
>
> =========================================================================
>
> diff --git a/lib/vhost/iotlb.c b/lib/vhost/iotlb.c
> index 6a729e8804..555b4ebba0 100644
> --- a/lib/vhost/iotlb.c
> +++ b/lib/vhost/iotlb.c
> @@ -149,6 +149,7 @@ vhost_user_iotlb_cache_remove_all(struct
> vhost_virtqueue *vq)
>          rte_rwlock_write_lock(&vq->iotlb_lock);
>
>          RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
> +               madvise(node->uaddr, node->size, MADV_DODUMP);
>                  TAILQ_REMOVE(&vq->iotlb_list, node, next);
>                  vhost_user_iotlb_pool_put(vq, node);
>          }
> @@ -170,6 +171,7 @@ vhost_user_iotlb_cache_random_evict(struct
> vhost_virtqueue *vq)
>
>          RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
>                  if (!entry_idx) {
> +                       madvise(node->uaddr, node->size, MADV_DODUMP);
>                          TAILQ_REMOVE(&vq->iotlb_list, node, next);
>                          vhost_user_iotlb_pool_put(vq, node);
>                          vq->iotlb_cache_nr--;
> @@ -222,12 +224,14 @@ vhost_user_iotlb_cache_insert(struct virtio_net
> *dev, struct vhost_virtqueue *vq
>                          vhost_user_iotlb_pool_put(vq, new_node);
>                          goto unlock;
>                  } else if (node->iova > new_node->iova) {
> +                       madvise(new_node->uaddr, new_node->size,
> MADV_DODUMP);
>                          TAILQ_INSERT_BEFORE(node, new_node, next);
>                          vq->iotlb_cache_nr++;
>                          goto unlock;
>                  }
>          }
>
> +       madvise(new_node->uaddr, new_node->size, MADV_DODUMP);
>          TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next);
>          vq->iotlb_cache_nr++;
>
> @@ -255,6 +259,7 @@ vhost_user_iotlb_cache_remove(struct vhost_virtqueue
> *vq,
>                          break;
>
>                  if (iova < node->iova + node->size) {
> +                       madvise(node->uaddr, node->size, MADV_DODUMP);
>                          TAILQ_REMOVE(&vq->iotlb_list, node, next);
>                          vhost_user_iotlb_pool_put(vq, node);
>                          vq->iotlb_cache_nr--;
> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> index 371d6304d6..208377ddc6 100644
> --- a/lib/vhost/vhost_user.c
> +++ b/lib/vhost/vhost_user.c
> @@ -767,6 +767,8 @@ translate_ring_addresses(struct virtio_net **pdev,
> struct vhost_virtqueue **pvq)
>                          return;
>                  }
>
> +               madvise(vq->desc_packed, len, MADV_DODUMP);
> +
>                  numa_realloc(&dev, &vq);
>                  *pdev = dev;
>                  *pvq = vq;
> @@ -782,6 +784,8 @@ translate_ring_addresses(struct virtio_net **pdev,
> struct vhost_virtqueue **pvq)
>                          return;
>                  }
>
> +               madvise(vq->driver_event, len, MADV_DODUMP);
> +
>                  len = sizeof(struct vring_packed_desc_event);
>                  vq->device_event = (struct vring_packed_desc_event *)
>                                          (uintptr_t)ring_addr_to_vva(dev,
> @@ -793,6 +797,8 @@ translate_ring_addresses(struct virtio_net **pdev,
> struct vhost_virtqueue **pvq)
>                          return;
>                  }
>
> +               madvise(vq->device_event, len, MADV_DODUMP);
> +
>                  vq->access_ok = true;
>                  return;
>          }
> @@ -809,6 +815,8 @@ translate_ring_addresses(struct virtio_net **pdev,
> struct vhost_virtqueue **pvq)
>                  return;
>          }
>
> +       madvise(vq->desc, len, MADV_DODUMP);
> +
>          numa_realloc(&dev, &vq);
>          *pdev = dev;
>          *pvq = vq;
> @@ -824,6 +832,8 @@ translate_ring_addresses(struct virtio_net **pdev,
> struct vhost_virtqueue **pvq)
>                  return;
>          }
>
> +       madvise(vq->avail, len, MADV_DODUMP);
> +
>          len = sizeof(struct vring_used) +
>                  sizeof(struct vring_used_elem) * vq->size;
>          if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX))
> @@ -836,6 +846,8 @@ translate_ring_addresses(struct virtio_net **pdev,
> struct vhost_virtqueue **pvq)
>                  return;
>          }
>
> +       madvise(vq->used, len, MADV_DODUMP);
> +
>          if (vq->last_used_idx != vq->used->idx) {
>                  VHOST_LOG_CONFIG(dev->ifname, WARNING,
>                          "last_used_idx (%u) and vq->used->idx (%u)
> mismatches;\n",
>
Ilya Maximets Dec. 2, 2022, 4:59 p.m. UTC | #12
On 12/2/22 11:36, Maxime Coquelin wrote:
> 
> 
> On 12/2/22 11:09, David Marchand wrote:
>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>> Shouldn't this be 0x7f instead?
>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping
>>>>>>> shared huge pages.  Or am I missing something?
>>>>>>
>>>>>> That's a good point, the hugepage may or may not be private. I'll send
>>>>>> in a new one.
>>>>>
>>>>> OK.  One thing to think about though is that we'll grab
>>>>> VM memory, I guess, in case we have vhost-user ports.
>>>>> So, the core dump size can become insanely huge.
>>>>>
>>>>> The downside of not having them is inability to inspect
>>>>> virtqueues and stuff in the dump.
>>>>
>>>> Did you consider madvise()?
>>>>
>>>>         MADV_DONTDUMP (since Linux 3.4)
>>>>                Exclude from a core dump those pages in the range
>>>> specified by addr and length.  This is useful in applications that
>>>> have large areas of memory that are known not to be useful in a core
>>>> dump.  The effect of  MADV_DONT‐
>>>>                DUMP takes precedence over the bit mask that is set via
>>>> the /proc/[pid]/coredump_filter file (see core(5)).
>>>>
>>>>         MADV_DODUMP (since Linux 3.4)
>>>>                Undo the effect of an earlier MADV_DONTDUMP.
>>>
>>> I don't think OVS actually knows location of particular VM memory
>>> pages that we do not need.  And dumping virtqueues and stuff is,
>>> probably, the point of this patch (?).
>>>
>>> vhost-user library might have a better idea on which particular parts
>>> of the memory guest may use for virtqueues and buffers, but I'm not
>>> 100% sure.
>>
>> Yes, distinguishing hugepages of interest is a problem.
>>
>> Since v20.05, DPDK mem allocator takes care of excluding (unused)
>> hugepages from dump.
>> So with this OVS patch, if we catch private and shared hugepages,
>> "interesting" DPDK hugepages will get dumped, which is useful for
>> debugging post mortem.
>>
>> Adding Maxime, who will have a better idea of what is possible for the
>> guest mapping part.
>>
>>
> 
> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap
> time, then there are two cases:
>   a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues
> memory. Doing so, we would have the rings memory, but not their buffers
> (except if they are located on same hugepages).
>   b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE
> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will
> get both vrings and their buffers the backend is allowed to access.

I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP
will override whatever user had in their global configuration.  Meaning
every DPDK application with vhost ports will start dumping some of the
guest pages with no actual ability to turn that off.

Can the behavior be configurable?

> 
> I can prepare a PoC quickly if someone is willing to experiment.
> 
> Regards,
> Maxime
> 
>
Mike Pattrick Dec. 2, 2022, 5:59 p.m. UTC | #13
On Fri, Dec 2, 2022 at 11:59 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 12/2/22 11:36, Maxime Coquelin wrote:
> >
> >
> > On 12/2/22 11:09, David Marchand wrote:
> >> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>> Shouldn't this be 0x7f instead?
> >>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping
> >>>>>>> shared huge pages.  Or am I missing something?
> >>>>>>
> >>>>>> That's a good point, the hugepage may or may not be private. I'll send
> >>>>>> in a new one.
> >>>>>
> >>>>> OK.  One thing to think about though is that we'll grab
> >>>>> VM memory, I guess, in case we have vhost-user ports.
> >>>>> So, the core dump size can become insanely huge.
> >>>>>
> >>>>> The downside of not having them is inability to inspect
> >>>>> virtqueues and stuff in the dump.
> >>>>
> >>>> Did you consider madvise()?
> >>>>
> >>>>         MADV_DONTDUMP (since Linux 3.4)
> >>>>                Exclude from a core dump those pages in the range
> >>>> specified by addr and length.  This is useful in applications that
> >>>> have large areas of memory that are known not to be useful in a core
> >>>> dump.  The effect of  MADV_DONT‐
> >>>>                DUMP takes precedence over the bit mask that is set via
> >>>> the /proc/[pid]/coredump_filter file (see core(5)).
> >>>>
> >>>>         MADV_DODUMP (since Linux 3.4)
> >>>>                Undo the effect of an earlier MADV_DONTDUMP.
> >>>
> >>> I don't think OVS actually knows location of particular VM memory
> >>> pages that we do not need.  And dumping virtqueues and stuff is,
> >>> probably, the point of this patch (?).
> >>>
> >>> vhost-user library might have a better idea on which particular parts
> >>> of the memory guest may use for virtqueues and buffers, but I'm not
> >>> 100% sure.
> >>
> >> Yes, distinguishing hugepages of interest is a problem.
> >>
> >> Since v20.05, DPDK mem allocator takes care of excluding (unused)
> >> hugepages from dump.
> >> So with this OVS patch, if we catch private and shared hugepages,
> >> "interesting" DPDK hugepages will get dumped, which is useful for
> >> debugging post mortem.
> >>
> >> Adding Maxime, who will have a better idea of what is possible for the
> >> guest mapping part.
> >>
> >>
> >
> > I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap
> > time, then there are two cases:
> >   a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues
> > memory. Doing so, we would have the rings memory, but not their buffers
> > (except if they are located on same hugepages).
> >   b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE
> > new entries and MADV_DONTDUMP on invalidated entries. Doing so we will
> > get both vrings and their buffers the backend is allowed to access.
>
> I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP
> will override whatever user had in their global configuration.  Meaning
> every DPDK application with vhost ports will start dumping some of the
> guest pages with no actual ability to turn that off.

I initially thought it would work that way, but the DODUMP flag just
disables the DONTDUMP flag.

https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055
https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033

Cheers,
M

>
> Can the behavior be configurable?
>
> >
> > I can prepare a PoC quickly if someone is willing to experiment.
> >
> > Regards,
> > Maxime
> >
> >
>
Ilya Maximets Dec. 2, 2022, 6:40 p.m. UTC | #14
On 12/2/22 18:59, Mike Pattrick wrote:
> On Fri, Dec 2, 2022 at 11:59 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 12/2/22 11:36, Maxime Coquelin wrote:
>>>
>>>
>>> On 12/2/22 11:09, David Marchand wrote:
>>>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>>> Shouldn't this be 0x7f instead?
>>>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping
>>>>>>>>> shared huge pages.  Or am I missing something?
>>>>>>>>
>>>>>>>> That's a good point, the hugepage may or may not be private. I'll send
>>>>>>>> in a new one.
>>>>>>>
>>>>>>> OK.  One thing to think about though is that we'll grab
>>>>>>> VM memory, I guess, in case we have vhost-user ports.
>>>>>>> So, the core dump size can become insanely huge.
>>>>>>>
>>>>>>> The downside of not having them is inability to inspect
>>>>>>> virtqueues and stuff in the dump.
>>>>>>
>>>>>> Did you consider madvise()?
>>>>>>
>>>>>>         MADV_DONTDUMP (since Linux 3.4)
>>>>>>                Exclude from a core dump those pages in the range
>>>>>> specified by addr and length.  This is useful in applications that
>>>>>> have large areas of memory that are known not to be useful in a core
>>>>>> dump.  The effect of  MADV_DONT‐
>>>>>>                DUMP takes precedence over the bit mask that is set via
>>>>>> the /proc/[pid]/coredump_filter file (see core(5)).
>>>>>>
>>>>>>         MADV_DODUMP (since Linux 3.4)
>>>>>>                Undo the effect of an earlier MADV_DONTDUMP.
>>>>>
>>>>> I don't think OVS actually knows location of particular VM memory
>>>>> pages that we do not need.  And dumping virtqueues and stuff is,
>>>>> probably, the point of this patch (?).
>>>>>
>>>>> vhost-user library might have a better idea on which particular parts
>>>>> of the memory guest may use for virtqueues and buffers, but I'm not
>>>>> 100% sure.
>>>>
>>>> Yes, distinguishing hugepages of interest is a problem.
>>>>
>>>> Since v20.05, DPDK mem allocator takes care of excluding (unused)
>>>> hugepages from dump.
>>>> So with this OVS patch, if we catch private and shared hugepages,
>>>> "interesting" DPDK hugepages will get dumped, which is useful for
>>>> debugging post mortem.
>>>>
>>>> Adding Maxime, who will have a better idea of what is possible for the
>>>> guest mapping part.
>>>>
>>>>
>>>
>>> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap
>>> time, then there are two cases:
>>>   a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues
>>> memory. Doing so, we would have the rings memory, but not their buffers
>>> (except if they are located on same hugepages).
>>>   b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE
>>> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will
>>> get both vrings and their buffers the backend is allowed to access.
>>
>> I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP
>> will override whatever user had in their global configuration.  Meaning
>> every DPDK application with vhost ports will start dumping some of the
>> guest pages with no actual ability to turn that off.
> 
> I initially thought it would work that way, but the DODUMP flag just
> disables the DONTDUMP flag.
> 
> https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055
> https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033

Hmm, interesting.  Makes sense.

Thanks for the pointers!

So, it should still be 7f regardless in the coredump filter for OVS, right?
Do you plan to update the current patch or do you think we should omit
shared pages until support for MADV_DO/DONTDUMP is added to vhost library?

Note that this will likely not be available in 22.11 as it's not a bug fix.
So, 23.11 at the earliest.

Basically 2 options:

1. 0x3f and not having shared pages.  Flip to 0x7f with DPDK 23.11 next year.
   Pros: Smaller files
   Cons: Missing some of the virtqueue memory until [potentially] DPDK 23.11.

2. 0x7f today.
   Pros: All the memory is available.
   Cons: [Significantly] larger files until [potentially] DPDK 23.11.

What do you think?  David, Maxime?

> 
> Cheers,
> M
> 
>>
>> Can the behavior be configurable?
>>
>>>
>>> I can prepare a PoC quickly if someone is willing to experiment.
>>>
>>> Regards,
>>> Maxime
>>>
>>>
>>
>
Mike Pattrick Dec. 2, 2022, 8:14 p.m. UTC | #15
On Fri, Dec 2, 2022 at 1:40 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 12/2/22 18:59, Mike Pattrick wrote:
> > On Fri, Dec 2, 2022 at 11:59 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >> On 12/2/22 11:36, Maxime Coquelin wrote:
> >>>
> >>>
> >>> On 12/2/22 11:09, David Marchand wrote:
> >>>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>>>> Shouldn't this be 0x7f instead?
> >>>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping
> >>>>>>>>> shared huge pages.  Or am I missing something?
> >>>>>>>>
> >>>>>>>> That's a good point, the hugepage may or may not be private. I'll send
> >>>>>>>> in a new one.
> >>>>>>>
> >>>>>>> OK.  One thing to think about though is that we'll grab
> >>>>>>> VM memory, I guess, in case we have vhost-user ports.
> >>>>>>> So, the core dump size can become insanely huge.
> >>>>>>>
> >>>>>>> The downside of not having them is inability to inspect
> >>>>>>> virtqueues and stuff in the dump.
> >>>>>>
> >>>>>> Did you consider madvise()?
> >>>>>>
> >>>>>>         MADV_DONTDUMP (since Linux 3.4)
> >>>>>>                Exclude from a core dump those pages in the range
> >>>>>> specified by addr and length.  This is useful in applications that
> >>>>>> have large areas of memory that are known not to be useful in a core
> >>>>>> dump.  The effect of  MADV_DONT‐
> >>>>>>                DUMP takes precedence over the bit mask that is set via
> >>>>>> the /proc/[pid]/coredump_filter file (see core(5)).
> >>>>>>
> >>>>>>         MADV_DODUMP (since Linux 3.4)
> >>>>>>                Undo the effect of an earlier MADV_DONTDUMP.
> >>>>>
> >>>>> I don't think OVS actually knows location of particular VM memory
> >>>>> pages that we do not need.  And dumping virtqueues and stuff is,
> >>>>> probably, the point of this patch (?).
> >>>>>
> >>>>> vhost-user library might have a better idea on which particular parts
> >>>>> of the memory guest may use for virtqueues and buffers, but I'm not
> >>>>> 100% sure.
> >>>>
> >>>> Yes, distinguishing hugepages of interest is a problem.
> >>>>
> >>>> Since v20.05, DPDK mem allocator takes care of excluding (unused)
> >>>> hugepages from dump.
> >>>> So with this OVS patch, if we catch private and shared hugepages,
> >>>> "interesting" DPDK hugepages will get dumped, which is useful for
> >>>> debugging post mortem.
> >>>>
> >>>> Adding Maxime, who will have a better idea of what is possible for the
> >>>> guest mapping part.
> >>>>
> >>>>
> >>>
> >>> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap
> >>> time, then there are two cases:
> >>>   a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues
> >>> memory. Doing so, we would have the rings memory, but not their buffers
> >>> (except if they are located on same hugepages).
> >>>   b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE
> >>> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will
> >>> get both vrings and their buffers the backend is allowed to access.
> >>
> >> I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP
> >> will override whatever user had in their global configuration.  Meaning
> >> every DPDK application with vhost ports will start dumping some of the
> >> guest pages with no actual ability to turn that off.
> >
> > I initially thought it would work that way, but the DODUMP flag just
> > disables the DONTDUMP flag.
> >
> > https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055
> > https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033
>
> Hmm, interesting.  Makes sense.
>
> Thanks for the pointers!
>
> So, it should still be 7f regardless in the coredump filter for OVS, right?
> Do you plan to update the current patch or do you think we should omit
> shared pages until support for MADV_DO/DONTDUMP is added to vhost library?
>
> Note that this will likely not be available in 22.11 as it's not a bug fix.
> So, 23.11 at the earliest.
>
> Basically 2 options:
>
> 1. 0x3f and not having shared pages.  Flip to 0x7f with DPDK 23.11 next year.
>    Pros: Smaller files
>    Cons: Missing some of the virtqueue memory until [potentially] DPDK 23.11.
>
> 2. 0x7f today.
>    Pros: All the memory is available.
>    Cons: [Significantly] larger files until [potentially] DPDK 23.11.
>
> What do you think?  David, Maxime?

I'd prefer 7f today. It's disabled by default, has zero impact on end
users, makes setting up debugging environments more convenient, and on
distributions with systemd the larger coredumps are managed somewhat
automatically. The news item already warns about large coredumps.

WDYT?

-M

> >
> > Cheers,
> > M
> >
> >>
> >> Can the behavior be configurable?
> >>
> >>>
> >>> I can prepare a PoC quickly if someone is willing to experiment.
> >>>
> >>> Regards,
> >>> Maxime
> >>>
> >>>
> >>
> >
>
David Marchand Dec. 5, 2022, 12:46 p.m. UTC | #16
On Fri, Dec 2, 2022 at 7:00 PM Mike Pattrick <mkp@redhat.com> wrote:
> > >>>> Did you consider madvise()?
> > >>>>
> > >>>>         MADV_DONTDUMP (since Linux 3.4)
> > >>>>                Exclude from a core dump those pages in the range
> > >>>> specified by addr and length.  This is useful in applications that
> > >>>> have large areas of memory that are known not to be useful in a core
> > >>>> dump.  The effect of  MADV_DONT‐
> > >>>>                DUMP takes precedence over the bit mask that is set via
> > >>>> the /proc/[pid]/coredump_filter file (see core(5)).
> > >>>>
> > >>>>         MADV_DODUMP (since Linux 3.4)
> > >>>>                Undo the effect of an earlier MADV_DONTDUMP.
> > >>>
> > I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP
> > will override whatever user had in their global configuration.  Meaning
> > every DPDK application with vhost ports will start dumping some of the
> > guest pages with no actual ability to turn that off.
>
> I initially thought it would work that way, but the DODUMP flag just
> disables the DONTDUMP flag.
>
> https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055
> https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033
>

Glad to read that the manual tells the same story than the kernel code :-).
Ilya Maximets Dec. 5, 2022, 12:49 p.m. UTC | #17
On 12/2/22 21:14, Mike Pattrick wrote:
> On Fri, Dec 2, 2022 at 1:40 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 12/2/22 18:59, Mike Pattrick wrote:
>>> On Fri, Dec 2, 2022 at 11:59 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>
>>>> On 12/2/22 11:36, Maxime Coquelin wrote:
>>>>>
>>>>>
>>>>> On 12/2/22 11:09, David Marchand wrote:
>>>>>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>>>>> Shouldn't this be 0x7f instead?
>>>>>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping
>>>>>>>>>>> shared huge pages.  Or am I missing something?
>>>>>>>>>>
>>>>>>>>>> That's a good point, the hugepage may or may not be private. I'll send
>>>>>>>>>> in a new one.
>>>>>>>>>
>>>>>>>>> OK.  One thing to think about though is that we'll grab
>>>>>>>>> VM memory, I guess, in case we have vhost-user ports.
>>>>>>>>> So, the core dump size can become insanely huge.
>>>>>>>>>
>>>>>>>>> The downside of not having them is inability to inspect
>>>>>>>>> virtqueues and stuff in the dump.
>>>>>>>>
>>>>>>>> Did you consider madvise()?
>>>>>>>>
>>>>>>>>         MADV_DONTDUMP (since Linux 3.4)
>>>>>>>>                Exclude from a core dump those pages in the range
>>>>>>>> specified by addr and length.  This is useful in applications that
>>>>>>>> have large areas of memory that are known not to be useful in a core
>>>>>>>> dump.  The effect of  MADV_DONT‐
>>>>>>>>                DUMP takes precedence over the bit mask that is set via
>>>>>>>> the /proc/[pid]/coredump_filter file (see core(5)).
>>>>>>>>
>>>>>>>>         MADV_DODUMP (since Linux 3.4)
>>>>>>>>                Undo the effect of an earlier MADV_DONTDUMP.
>>>>>>>
>>>>>>> I don't think OVS actually knows location of particular VM memory
>>>>>>> pages that we do not need.  And dumping virtqueues and stuff is,
>>>>>>> probably, the point of this patch (?).
>>>>>>>
>>>>>>> vhost-user library might have a better idea on which particular parts
>>>>>>> of the memory guest may use for virtqueues and buffers, but I'm not
>>>>>>> 100% sure.
>>>>>>
>>>>>> Yes, distinguishing hugepages of interest is a problem.
>>>>>>
>>>>>> Since v20.05, DPDK mem allocator takes care of excluding (unused)
>>>>>> hugepages from dump.
>>>>>> So with this OVS patch, if we catch private and shared hugepages,
>>>>>> "interesting" DPDK hugepages will get dumped, which is useful for
>>>>>> debugging post mortem.
>>>>>>
>>>>>> Adding Maxime, who will have a better idea of what is possible for the
>>>>>> guest mapping part.
>>>>>>
>>>>>>
>>>>>
>>>>> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap
>>>>> time, then there are two cases:
>>>>>   a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues
>>>>> memory. Doing so, we would have the rings memory, but not their buffers
>>>>> (except if they are located on same hugepages).
>>>>>   b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE
>>>>> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will
>>>>> get both vrings and their buffers the backend is allowed to access.
>>>>
>>>> I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP
>>>> will override whatever user had in their global configuration.  Meaning
>>>> every DPDK application with vhost ports will start dumping some of the
>>>> guest pages with no actual ability to turn that off.
>>>
>>> I initially thought it would work that way, but the DODUMP flag just
>>> disables the DONTDUMP flag.
>>>
>>> https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055
>>> https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033
>>
>> Hmm, interesting.  Makes sense.
>>
>> Thanks for the pointers!
>>
>> So, it should still be 7f regardless in the coredump filter for OVS, right?
>> Do you plan to update the current patch or do you think we should omit
>> shared pages until support for MADV_DO/DONTDUMP is added to vhost library?
>>
>> Note that this will likely not be available in 22.11 as it's not a bug fix.
>> So, 23.11 at the earliest.
>>
>> Basically 2 options:
>>
>> 1. 0x3f and not having shared pages.  Flip to 0x7f with DPDK 23.11 next year.
>>    Pros: Smaller files
>>    Cons: Missing some of the virtqueue memory until [potentially] DPDK 23.11.
>>
>> 2. 0x7f today.
>>    Pros: All the memory is available.
>>    Cons: [Significantly] larger files until [potentially] DPDK 23.11.
>>
>> What do you think?  David, Maxime?
> 
> I'd prefer 7f today. It's disabled by default, has zero impact on end
> users, makes setting up debugging environments more convenient, and on
> distributions with systemd the larger coredumps are managed somewhat
> automatically. The news item already warns about large coredumps.
> 
> WDYT?

Sounds good to me.

> 
> -M
> 
>>>
>>> Cheers,
>>> M
>>>
>>>>
>>>> Can the behavior be configurable?
>>>>
>>>>>
>>>>> I can prepare a PoC quickly if someone is willing to experiment.
>>>>>
>>>>> Regards,
>>>>> Maxime
>>>>>
>>>>>
>>>>
>>>
>>
>
David Marchand Dec. 5, 2022, 12:52 p.m. UTC | #18
On Fri, Dec 2, 2022 at 9:14 PM Mike Pattrick <mkp@redhat.com> wrote:
>
> On Fri, Dec 2, 2022 at 1:40 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 12/2/22 18:59, Mike Pattrick wrote:
> > > On Fri, Dec 2, 2022 at 11:59 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >>
> > >> On 12/2/22 11:36, Maxime Coquelin wrote:
> > >>>
> > >>>
> > >>> On 12/2/22 11:09, David Marchand wrote:
> > >>>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >>>>>>>>> Shouldn't this be 0x7f instead?
> > >>>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping
> > >>>>>>>>> shared huge pages.  Or am I missing something?
> > >>>>>>>>
> > >>>>>>>> That's a good point, the hugepage may or may not be private. I'll send
> > >>>>>>>> in a new one.
> > >>>>>>>
> > >>>>>>> OK.  One thing to think about though is that we'll grab
> > >>>>>>> VM memory, I guess, in case we have vhost-user ports.
> > >>>>>>> So, the core dump size can become insanely huge.
> > >>>>>>>
> > >>>>>>> The downside of not having them is inability to inspect
> > >>>>>>> virtqueues and stuff in the dump.
> > >>>>>>
> > >>>>>> Did you consider madvise()?
> > >>>>>>
> > >>>>>>         MADV_DONTDUMP (since Linux 3.4)
> > >>>>>>                Exclude from a core dump those pages in the range
> > >>>>>> specified by addr and length.  This is useful in applications that
> > >>>>>> have large areas of memory that are known not to be useful in a core
> > >>>>>> dump.  The effect of  MADV_DONT‐
> > >>>>>>                DUMP takes precedence over the bit mask that is set via
> > >>>>>> the /proc/[pid]/coredump_filter file (see core(5)).
> > >>>>>>
> > >>>>>>         MADV_DODUMP (since Linux 3.4)
> > >>>>>>                Undo the effect of an earlier MADV_DONTDUMP.
> > >>>>>
> > >>>>> I don't think OVS actually knows location of particular VM memory
> > >>>>> pages that we do not need.  And dumping virtqueues and stuff is,
> > >>>>> probably, the point of this patch (?).
> > >>>>>
> > >>>>> vhost-user library might have a better idea on which particular parts
> > >>>>> of the memory guest may use for virtqueues and buffers, but I'm not
> > >>>>> 100% sure.
> > >>>>
> > >>>> Yes, distinguishing hugepages of interest is a problem.
> > >>>>
> > >>>> Since v20.05, DPDK mem allocator takes care of excluding (unused)
> > >>>> hugepages from dump.
> > >>>> So with this OVS patch, if we catch private and shared hugepages,
> > >>>> "interesting" DPDK hugepages will get dumped, which is useful for
> > >>>> debugging post mortem.
> > >>>>
> > >>>> Adding Maxime, who will have a better idea of what is possible for the
> > >>>> guest mapping part.
> > >>>>
> > >>>>
> > >>>
> > >>> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap
> > >>> time, then there are two cases:
> > >>>   a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues
> > >>> memory. Doing so, we would have the rings memory, but not their buffers
> > >>> (except if they are located on same hugepages).
> > >>>   b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE
> > >>> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will
> > >>> get both vrings and their buffers the backend is allowed to access.
> > >>
> > >> I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP
> > >> will override whatever user had in their global configuration.  Meaning
> > >> every DPDK application with vhost ports will start dumping some of the
> > >> guest pages with no actual ability to turn that off.
> > >
> > > I initially thought it would work that way, but the DODUMP flag just
> > > disables the DONTDUMP flag.
> > >
> > > https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055
> > > https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033
> >
> > Hmm, interesting.  Makes sense.
> >
> > Thanks for the pointers!
> >
> > So, it should still be 7f regardless in the coredump filter for OVS, right?
> > Do you plan to update the current patch or do you think we should omit
> > shared pages until support for MADV_DO/DONTDUMP is added to vhost library?
> >
> > Note that this will likely not be available in 22.11 as it's not a bug fix.
> > So, 23.11 at the earliest.
> >
> > Basically 2 options:
> >
> > 1. 0x3f and not having shared pages.  Flip to 0x7f with DPDK 23.11 next year.
> >    Pros: Smaller files
> >    Cons: Missing some of the virtqueue memory until [potentially] DPDK 23.11.

Mm, if someone still has some --socket-mem config, then I guess shared
hugepages will be in use in DPDK.


> >
> > 2. 0x7f today.
> >    Pros: All the memory is available.
> >    Cons: [Significantly] larger files until [potentially] DPDK 23.11.
> >
> > What do you think?  David, Maxime?
>
> I'd prefer 7f today. It's disabled by default, has zero impact on end
> users, makes setting up debugging environments more convenient, and on
> distributions with systemd the larger coredumps are managed somewhat
> automatically. The news item already warns about large coredumps.

I prefer the latter suggestion.
Ilya Maximets Dec. 5, 2022, 12:54 p.m. UTC | #19
On 12/5/22 13:46, David Marchand wrote:
> On Fri, Dec 2, 2022 at 7:00 PM Mike Pattrick <mkp@redhat.com> wrote:
>>>>>>> Did you consider madvise()?
>>>>>>>
>>>>>>>         MADV_DONTDUMP (since Linux 3.4)
>>>>>>>                Exclude from a core dump those pages in the range
>>>>>>> specified by addr and length.  This is useful in applications that
>>>>>>> have large areas of memory that are known not to be useful in a core
>>>>>>> dump.  The effect of  MADV_DONT‐
>>>>>>>                DUMP takes precedence over the bit mask that is set via
>>>>>>> the /proc/[pid]/coredump_filter file (see core(5)).
>>>>>>>
>>>>>>>         MADV_DODUMP (since Linux 3.4)
>>>>>>>                Undo the effect of an earlier MADV_DONTDUMP.
>>>>>>
>>> I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP
>>> will override whatever user had in their global configuration.  Meaning
>>> every DPDK application with vhost ports will start dumping some of the
>>> guest pages with no actual ability to turn that off.
>>
>> I initially thought it would work that way, but the DODUMP flag just
>> disables the DONTDUMP flag.
>>
>> https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055
>> https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033
>>
> 
> Glad to read that the manual tells the same story than the kernel code :-).

Manuals, pfff.  I seem to automatically skip them even if quoted directly
in the thread. :D
diff mbox series

Patch

diff --git a/NEWS b/NEWS
index 8fa57836a..7af60dce3 100644
--- a/NEWS
+++ b/NEWS
@@ -3,6 +3,10 @@  Post-v2.17.0
    - OVSDB:
      * 'relay' service model now supports transaction history, i.e. honors the
        'last-txn-id' field in 'monitor_cond_since' requests from clients.
+   - ovs-ctl:
+     * New option '--dump-hugepages' to include hugepages in core dumps. This
+       can assist with postmortem analysis involving DPDK, but may also produce
+       significantly larger core dump files.
 
 
 v2.17.0 - 17 Feb 2022
diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in
index e6e07f476..8f900314b 100644
--- a/utilities/ovs-ctl.in
+++ b/utilities/ovs-ctl.in
@@ -103,8 +103,13 @@  set_system_ids () {
     action "Configuring Open vSwitch system IDs" "$@" $extra_ids
 }
 
-check_force_cores () {
-    if test X"$FORCE_COREFILES" = Xyes; then
+check_core_config () {
+    if test X"$DUMP_HUGEPAGES" = Xyes; then
+        echo 0x3f > /proc/self/coredump_filter
+        if test X"$FORCE_COREFILES" = Xyes; then
+            ulimit -c unlimited
+        fi
+    elif test X"$FORCE_COREFILES" = Xyes; then
         ulimit -c 67108864
     fi
 }
@@ -116,7 +121,7 @@  del_transient_ports () {
 }
 
 do_start_ovsdb () {
-    check_force_cores
+    check_core_config
 
     if daemon_is_running ovsdb-server; then
         log_success_msg "ovsdb-server is already running"
@@ -193,7 +198,7 @@  add_managers () {
 }
 
 do_start_forwarding () {
-    check_force_cores
+    check_core_config
 
     insert_mod_if_required || return 1
 
@@ -330,6 +335,7 @@  set_defaults () {
 
     DAEMON_CWD=/
     FORCE_COREFILES=yes
+    DUMP_HUGEPAGES=no
     MLOCKALL=yes
     SELF_CONFINEMENT=yes
     MONITOR=yes
@@ -419,6 +425,7 @@  Other important options for "start", "restart" and "force-reload-kmod":
 Less important options for "start", "restart" and "force-reload-kmod":
   --daemon-cwd=DIR               set working dir for OVS daemons (default: $DAEMON_CWD)
   --no-force-corefiles           do not force on core dumps for OVS daemons
+  --dump-hugepages               include hugepages in coredumps
   --no-mlockall                  do not lock all of ovs-vswitchd into memory
   --ovsdb-server-priority=NICE   set ovsdb-server's niceness (default: $OVSDB_SERVER_PRIORITY)
   --ovsdb-server-options=OPTIONS additional options for ovsdb-server (example: '-vconsole:dbg -vfile:dbg')