Message ID | 20220325221712.567255-1-mkp@redhat.com |
---|---|
State | Changes Requested |
Headers | show |
Series | [ovs-dev,v1] ovs-ctl: Allow inclusion of hugepages in coredumps | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/intel-ovs-compilation | success | test: success |
Hi Mike, Sorry it took that long to review this patch. On 3/25/22 23:17, Mike Pattrick wrote: > Add new option --dump-hugepages option in ovs-ctl to enable the addition > of hugepages in the core dump filter. > > Signed-off-by: Mike Pattrick <mkp@redhat.com> > --- > NEWS | 4 ++++ > utilities/ovs-ctl.in | 15 +++++++++++---- > 2 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/NEWS b/NEWS > index 8fa57836a..7af60dce3 100644 > --- a/NEWS > +++ b/NEWS > @@ -3,6 +3,10 @@ Post-v2.17.0 > - OVSDB: > * 'relay' service model now supports transaction history, i.e. honors the > 'last-txn-id' field in 'monitor_cond_since' requests from clients. > + - ovs-ctl: > + * New option '--dump-hugepages' to include hugepages in core dumps. This > + can assist with postmortem analysis involving DPDK, but may also produce > + significantly larger core dump files. > I'm afraid this part needs rebasing. > > v2.17.0 - 17 Feb 2022 > diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in > index e6e07f476..8f900314b 100644 > --- a/utilities/ovs-ctl.in > +++ b/utilities/ovs-ctl.in > @@ -103,8 +103,13 @@ set_system_ids () { > action "Configuring Open vSwitch system IDs" "$@" $extra_ids > } > > -check_force_cores () { > - if test X"$FORCE_COREFILES" = Xyes; then > +check_core_config () { > + if test X"$DUMP_HUGEPAGES" = Xyes; then > + echo 0x3f > /proc/self/coredump_filter > + if test X"$FORCE_COREFILES" = Xyes; then > + ulimit -c unlimited > + fi > + elif test X"$FORCE_COREFILES" = Xyes; then > ulimit -c 67108864 > fi > } > @@ -116,7 +121,7 @@ del_transient_ports () { > } > > do_start_ovsdb () { > - check_force_cores > + check_core_config > > if daemon_is_running ovsdb-server; then > log_success_msg "ovsdb-server is already running" > @@ -193,7 +198,7 @@ add_managers () { > } > > do_start_forwarding () { > - check_force_cores > + check_core_config > > insert_mod_if_required || return 1 > > @@ -330,6 +335,7 @@ set_defaults () { > > DAEMON_CWD=/ > FORCE_COREFILES=yes > + DUMP_HUGEPAGES=no > MLOCKALL=yes > SELF_CONFINEMENT=yes > MONITOR=yes > @@ -419,6 +425,7 @@ Other important options for "start", "restart" and "force-reload-kmod": > Less important options for "start", "restart" and "force-reload-kmod": > --daemon-cwd=DIR set working dir for OVS daemons (default: $DAEMON_CWD) > --no-force-corefiles do not force on core dumps for OVS daemons > + --dump-hugepages include hugepages in coredumps > --no-mlockall do not lock all of ovs-vswitchd into memory > --ovsdb-server-priority=NICE set ovsdb-server's niceness (default: $OVSDB_SERVER_PRIORITY) > --ovsdb-server-options=OPTIONS additional options for ovsdb-server (example: '-vconsole:dbg -vfile:dbg') > Tested locally and verified that with the option hugepages appear in coredumps. Apart from the need to rebase the NEWS, the patch looks good to me. Acked-by: Adrian Moreno <amorenoz@redhat.com> -- Adrián Moreno
On 11/25/22 18:19, Adrian Moreno wrote: > Hi Mike, > > Sorry it took that long to review this patch. > > On 3/25/22 23:17, Mike Pattrick wrote: >> Add new option --dump-hugepages option in ovs-ctl to enable the addition >> of hugepages in the core dump filter. >> >> Signed-off-by: Mike Pattrick <mkp@redhat.com> >> --- >> NEWS | 4 ++++ >> utilities/ovs-ctl.in | 15 +++++++++++---- >> 2 files changed, 15 insertions(+), 4 deletions(-) >> >> diff --git a/NEWS b/NEWS >> index 8fa57836a..7af60dce3 100644 >> --- a/NEWS >> +++ b/NEWS >> @@ -3,6 +3,10 @@ Post-v2.17.0 >> - OVSDB: >> * 'relay' service model now supports transaction history, i.e. honors the >> 'last-txn-id' field in 'monitor_cond_since' requests from clients. >> + - ovs-ctl: >> + * New option '--dump-hugepages' to include hugepages in core dumps. This >> + can assist with postmortem analysis involving DPDK, but may also produce >> + significantly larger core dump files. >> > > I'm afraid this part needs rebasing. > >> v2.17.0 - 17 Feb 2022 >> diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in >> index e6e07f476..8f900314b 100644 >> --- a/utilities/ovs-ctl.in >> +++ b/utilities/ovs-ctl.in >> @@ -103,8 +103,13 @@ set_system_ids () { >> action "Configuring Open vSwitch system IDs" "$@" $extra_ids >> } >> -check_force_cores () { >> - if test X"$FORCE_COREFILES" = Xyes; then >> +check_core_config () { >> + if test X"$DUMP_HUGEPAGES" = Xyes; then >> + echo 0x3f > /proc/self/coredump_filter Shouldn't this be 0x7f instead? 0x3f doesn't enable bit #6, which is responsible for dumping shared huge pages. Or am I missing something? Best regards, Ilya Maximets. >> + if test X"$FORCE_COREFILES" = Xyes; then >> + ulimit -c unlimited >> + fi >> + elif test X"$FORCE_COREFILES" = Xyes; then >> ulimit -c 67108864 >> fi >> } >> @@ -116,7 +121,7 @@ del_transient_ports () { >> } >> do_start_ovsdb () { >> - check_force_cores >> + check_core_config >> if daemon_is_running ovsdb-server; then >> log_success_msg "ovsdb-server is already running" >> @@ -193,7 +198,7 @@ add_managers () { >> } >> do_start_forwarding () { >> - check_force_cores >> + check_core_config >> insert_mod_if_required || return 1 >> @@ -330,6 +335,7 @@ set_defaults () { >> DAEMON_CWD=/ >> FORCE_COREFILES=yes >> + DUMP_HUGEPAGES=no >> MLOCKALL=yes >> SELF_CONFINEMENT=yes >> MONITOR=yes >> @@ -419,6 +425,7 @@ Other important options for "start", "restart" and "force-reload-kmod": >> Less important options for "start", "restart" and "force-reload-kmod": >> --daemon-cwd=DIR set working dir for OVS daemons (default: $DAEMON_CWD) >> --no-force-corefiles do not force on core dumps for OVS daemons >> + --dump-hugepages include hugepages in coredumps >> --no-mlockall do not lock all of ovs-vswitchd into memory >> --ovsdb-server-priority=NICE set ovsdb-server's niceness (default: $OVSDB_SERVER_PRIORITY) >> --ovsdb-server-options=OPTIONS additional options for ovsdb-server (example: '-vconsole:dbg -vfile:dbg') >> > > Tested locally and verified that with the option hugepages appear in coredumps. > Apart from the need to rebase the NEWS, the patch looks good to me. > > Acked-by: Adrian Moreno <amorenoz@redhat.com> > > -- > Adrián Moreno
On Wed, Nov 30, 2022 at 7:27 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 11/25/22 18:19, Adrian Moreno wrote: > > Hi Mike, > > > > Sorry it took that long to review this patch. > > > > On 3/25/22 23:17, Mike Pattrick wrote: > >> Add new option --dump-hugepages option in ovs-ctl to enable the addition > >> of hugepages in the core dump filter. > >> > >> Signed-off-by: Mike Pattrick <mkp@redhat.com> > >> --- > >> NEWS | 4 ++++ > >> utilities/ovs-ctl.in | 15 +++++++++++---- > >> 2 files changed, 15 insertions(+), 4 deletions(-) > >> > >> diff --git a/NEWS b/NEWS > >> index 8fa57836a..7af60dce3 100644 > >> --- a/NEWS > >> +++ b/NEWS > >> @@ -3,6 +3,10 @@ Post-v2.17.0 > >> - OVSDB: > >> * 'relay' service model now supports transaction history, i.e. honors the > >> 'last-txn-id' field in 'monitor_cond_since' requests from clients. > >> + - ovs-ctl: > >> + * New option '--dump-hugepages' to include hugepages in core dumps. This > >> + can assist with postmortem analysis involving DPDK, but may also produce > >> + significantly larger core dump files. > >> > > > > I'm afraid this part needs rebasing. > > > >> v2.17.0 - 17 Feb 2022 > >> diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in > >> index e6e07f476..8f900314b 100644 > >> --- a/utilities/ovs-ctl.in > >> +++ b/utilities/ovs-ctl.in > >> @@ -103,8 +103,13 @@ set_system_ids () { > >> action "Configuring Open vSwitch system IDs" "$@" $extra_ids > >> } > >> -check_force_cores () { > >> - if test X"$FORCE_COREFILES" = Xyes; then > >> +check_core_config () { > >> + if test X"$DUMP_HUGEPAGES" = Xyes; then > >> + echo 0x3f > /proc/self/coredump_filter > > Shouldn't this be 0x7f instead? > 0x3f doesn't enable bit #6, which is responsible for dumping > shared huge pages. Or am I missing something? That's a good point, the hugepage may or may not be private. I'll send in a new one. Cheers, M > > Best regards, Ilya Maximets. > > >> + if test X"$FORCE_COREFILES" = Xyes; then > >> + ulimit -c unlimited > >> + fi > >> + elif test X"$FORCE_COREFILES" = Xyes; then > >> ulimit -c 67108864 > >> fi > >> } > >> @@ -116,7 +121,7 @@ del_transient_ports () { > >> } > >> do_start_ovsdb () { > >> - check_force_cores > >> + check_core_config > >> if daemon_is_running ovsdb-server; then > >> log_success_msg "ovsdb-server is already running" > >> @@ -193,7 +198,7 @@ add_managers () { > >> } > >> do_start_forwarding () { > >> - check_force_cores > >> + check_core_config > >> insert_mod_if_required || return 1 > >> @@ -330,6 +335,7 @@ set_defaults () { > >> DAEMON_CWD=/ > >> FORCE_COREFILES=yes > >> + DUMP_HUGEPAGES=no > >> MLOCKALL=yes > >> SELF_CONFINEMENT=yes > >> MONITOR=yes > >> @@ -419,6 +425,7 @@ Other important options for "start", "restart" and "force-reload-kmod": > >> Less important options for "start", "restart" and "force-reload-kmod": > >> --daemon-cwd=DIR set working dir for OVS daemons (default: $DAEMON_CWD) > >> --no-force-corefiles do not force on core dumps for OVS daemons > >> + --dump-hugepages include hugepages in coredumps > >> --no-mlockall do not lock all of ovs-vswitchd into memory > >> --ovsdb-server-priority=NICE set ovsdb-server's niceness (default: $OVSDB_SERVER_PRIORITY) > >> --ovsdb-server-options=OPTIONS additional options for ovsdb-server (example: '-vconsole:dbg -vfile:dbg') > >> > > > > Tested locally and verified that with the option hugepages appear in coredumps. > > Apart from the need to rebase the NEWS, the patch looks good to me. > > > > Acked-by: Adrian Moreno <amorenoz@redhat.com> > > > > -- > > Adrián Moreno >
On 11/30/22 20:11, Mike Pattrick wrote: > On Wed, Nov 30, 2022 at 7:27 AM Ilya Maximets <i.maximets@ovn.org> wrote: >> >> On 11/25/22 18:19, Adrian Moreno wrote: >>> Hi Mike, >>> >>> Sorry it took that long to review this patch. >>> >>> On 3/25/22 23:17, Mike Pattrick wrote: >>>> Add new option --dump-hugepages option in ovs-ctl to enable the addition >>>> of hugepages in the core dump filter. >>>> >>>> Signed-off-by: Mike Pattrick <mkp@redhat.com> >>>> --- >>>> NEWS | 4 ++++ >>>> utilities/ovs-ctl.in | 15 +++++++++++---- >>>> 2 files changed, 15 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/NEWS b/NEWS >>>> index 8fa57836a..7af60dce3 100644 >>>> --- a/NEWS >>>> +++ b/NEWS >>>> @@ -3,6 +3,10 @@ Post-v2.17.0 >>>> - OVSDB: >>>> * 'relay' service model now supports transaction history, i.e. honors the >>>> 'last-txn-id' field in 'monitor_cond_since' requests from clients. >>>> + - ovs-ctl: >>>> + * New option '--dump-hugepages' to include hugepages in core dumps. This >>>> + can assist with postmortem analysis involving DPDK, but may also produce >>>> + significantly larger core dump files. >>>> >>> >>> I'm afraid this part needs rebasing. >>> >>>> v2.17.0 - 17 Feb 2022 >>>> diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in >>>> index e6e07f476..8f900314b 100644 >>>> --- a/utilities/ovs-ctl.in >>>> +++ b/utilities/ovs-ctl.in >>>> @@ -103,8 +103,13 @@ set_system_ids () { >>>> action "Configuring Open vSwitch system IDs" "$@" $extra_ids >>>> } >>>> -check_force_cores () { >>>> - if test X"$FORCE_COREFILES" = Xyes; then >>>> +check_core_config () { >>>> + if test X"$DUMP_HUGEPAGES" = Xyes; then >>>> + echo 0x3f > /proc/self/coredump_filter >> >> Shouldn't this be 0x7f instead? >> 0x3f doesn't enable bit #6, which is responsible for dumping >> shared huge pages. Or am I missing something? > > That's a good point, the hugepage may or may not be private. I'll send > in a new one. OK. One thing to think about though is that we'll grab VM memory, I guess, in case we have vhost-user ports. So, the core dump size can become insanely huge. The downside of not having them is inability to inspect virtqueues and stuff in the dump. > > Cheers, > M > >> >> Best regards, Ilya Maximets. >> >>>> + if test X"$FORCE_COREFILES" = Xyes; then >>>> + ulimit -c unlimited >>>> + fi >>>> + elif test X"$FORCE_COREFILES" = Xyes; then >>>> ulimit -c 67108864 >>>> fi >>>> } >>>> @@ -116,7 +121,7 @@ del_transient_ports () { >>>> } >>>> do_start_ovsdb () { >>>> - check_force_cores >>>> + check_core_config >>>> if daemon_is_running ovsdb-server; then >>>> log_success_msg "ovsdb-server is already running" >>>> @@ -193,7 +198,7 @@ add_managers () { >>>> } >>>> do_start_forwarding () { >>>> - check_force_cores >>>> + check_core_config >>>> insert_mod_if_required || return 1 >>>> @@ -330,6 +335,7 @@ set_defaults () { >>>> DAEMON_CWD=/ >>>> FORCE_COREFILES=yes >>>> + DUMP_HUGEPAGES=no >>>> MLOCKALL=yes >>>> SELF_CONFINEMENT=yes >>>> MONITOR=yes >>>> @@ -419,6 +425,7 @@ Other important options for "start", "restart" and "force-reload-kmod": >>>> Less important options for "start", "restart" and "force-reload-kmod": >>>> --daemon-cwd=DIR set working dir for OVS daemons (default: $DAEMON_CWD) >>>> --no-force-corefiles do not force on core dumps for OVS daemons >>>> + --dump-hugepages include hugepages in coredumps >>>> --no-mlockall do not lock all of ovs-vswitchd into memory >>>> --ovsdb-server-priority=NICE set ovsdb-server's niceness (default: $OVSDB_SERVER_PRIORITY) >>>> --ovsdb-server-options=OPTIONS additional options for ovsdb-server (example: '-vconsole:dbg -vfile:dbg') >>>> >>> >>> Tested locally and verified that with the option hugepages appear in coredumps. >>> Apart from the need to rebase the NEWS, the patch looks good to me. >>> >>> Acked-by: Adrian Moreno <amorenoz@redhat.com> >>> >>> -- >>> Adrián Moreno >> >
On Wed, Nov 30, 2022 at 8:46 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 11/30/22 20:11, Mike Pattrick wrote: > > On Wed, Nov 30, 2022 at 7:27 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >> > >> On 11/25/22 18:19, Adrian Moreno wrote: > >>> Hi Mike, > >>> > >>> Sorry it took that long to review this patch. > >>> > >>> On 3/25/22 23:17, Mike Pattrick wrote: > >>>> Add new option --dump-hugepages option in ovs-ctl to enable the addition > >>>> of hugepages in the core dump filter. > >>>> > >>>> Signed-off-by: Mike Pattrick <mkp@redhat.com> > >>>> --- > >>>> NEWS | 4 ++++ > >>>> utilities/ovs-ctl.in | 15 +++++++++++---- > >>>> 2 files changed, 15 insertions(+), 4 deletions(-) > >>>> > >>>> diff --git a/NEWS b/NEWS > >>>> index 8fa57836a..7af60dce3 100644 > >>>> --- a/NEWS > >>>> +++ b/NEWS > >>>> @@ -3,6 +3,10 @@ Post-v2.17.0 > >>>> - OVSDB: > >>>> * 'relay' service model now supports transaction history, i.e. honors the > >>>> 'last-txn-id' field in 'monitor_cond_since' requests from clients. > >>>> + - ovs-ctl: > >>>> + * New option '--dump-hugepages' to include hugepages in core dumps. This > >>>> + can assist with postmortem analysis involving DPDK, but may also produce > >>>> + significantly larger core dump files. > >>>> > >>> > >>> I'm afraid this part needs rebasing. > >>> > >>>> v2.17.0 - 17 Feb 2022 > >>>> diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in > >>>> index e6e07f476..8f900314b 100644 > >>>> --- a/utilities/ovs-ctl.in > >>>> +++ b/utilities/ovs-ctl.in > >>>> @@ -103,8 +103,13 @@ set_system_ids () { > >>>> action "Configuring Open vSwitch system IDs" "$@" $extra_ids > >>>> } > >>>> -check_force_cores () { > >>>> - if test X"$FORCE_COREFILES" = Xyes; then > >>>> +check_core_config () { > >>>> + if test X"$DUMP_HUGEPAGES" = Xyes; then > >>>> + echo 0x3f > /proc/self/coredump_filter > >> > >> Shouldn't this be 0x7f instead? > >> 0x3f doesn't enable bit #6, which is responsible for dumping > >> shared huge pages. Or am I missing something? > > > > That's a good point, the hugepage may or may not be private. I'll send > > in a new one. > > OK. One thing to think about though is that we'll grab > VM memory, I guess, in case we have vhost-user ports. > So, the core dump size can become insanely huge. > > The downside of not having them is inability to inspect > virtqueues and stuff in the dump. Did you consider madvise()? MADV_DONTDUMP (since Linux 3.4) Exclude from a core dump those pages in the range specified by addr and length. This is useful in applications that have large areas of memory that are known not to be useful in a core dump. The effect of MADV_DONT‐ DUMP takes precedence over the bit mask that is set via the /proc/[pid]/coredump_filter file (see core(5)). MADV_DODUMP (since Linux 3.4) Undo the effect of an earlier MADV_DONTDUMP.
On 11/30/22 20:52, David Marchand wrote: > On Wed, Nov 30, 2022 at 8:46 PM Ilya Maximets <i.maximets@ovn.org> wrote: >> >> On 11/30/22 20:11, Mike Pattrick wrote: >>> On Wed, Nov 30, 2022 at 7:27 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>> >>>> On 11/25/22 18:19, Adrian Moreno wrote: >>>>> Hi Mike, >>>>> >>>>> Sorry it took that long to review this patch. >>>>> >>>>> On 3/25/22 23:17, Mike Pattrick wrote: >>>>>> Add new option --dump-hugepages option in ovs-ctl to enable the addition >>>>>> of hugepages in the core dump filter. >>>>>> >>>>>> Signed-off-by: Mike Pattrick <mkp@redhat.com> >>>>>> --- >>>>>> NEWS | 4 ++++ >>>>>> utilities/ovs-ctl.in | 15 +++++++++++---- >>>>>> 2 files changed, 15 insertions(+), 4 deletions(-) >>>>>> >>>>>> diff --git a/NEWS b/NEWS >>>>>> index 8fa57836a..7af60dce3 100644 >>>>>> --- a/NEWS >>>>>> +++ b/NEWS >>>>>> @@ -3,6 +3,10 @@ Post-v2.17.0 >>>>>> - OVSDB: >>>>>> * 'relay' service model now supports transaction history, i.e. honors the >>>>>> 'last-txn-id' field in 'monitor_cond_since' requests from clients. >>>>>> + - ovs-ctl: >>>>>> + * New option '--dump-hugepages' to include hugepages in core dumps. This >>>>>> + can assist with postmortem analysis involving DPDK, but may also produce >>>>>> + significantly larger core dump files. >>>>>> >>>>> >>>>> I'm afraid this part needs rebasing. >>>>> >>>>>> v2.17.0 - 17 Feb 2022 >>>>>> diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in >>>>>> index e6e07f476..8f900314b 100644 >>>>>> --- a/utilities/ovs-ctl.in >>>>>> +++ b/utilities/ovs-ctl.in >>>>>> @@ -103,8 +103,13 @@ set_system_ids () { >>>>>> action "Configuring Open vSwitch system IDs" "$@" $extra_ids >>>>>> } >>>>>> -check_force_cores () { >>>>>> - if test X"$FORCE_COREFILES" = Xyes; then >>>>>> +check_core_config () { >>>>>> + if test X"$DUMP_HUGEPAGES" = Xyes; then >>>>>> + echo 0x3f > /proc/self/coredump_filter >>>> >>>> Shouldn't this be 0x7f instead? >>>> 0x3f doesn't enable bit #6, which is responsible for dumping >>>> shared huge pages. Or am I missing something? >>> >>> That's a good point, the hugepage may or may not be private. I'll send >>> in a new one. >> >> OK. One thing to think about though is that we'll grab >> VM memory, I guess, in case we have vhost-user ports. >> So, the core dump size can become insanely huge. >> >> The downside of not having them is inability to inspect >> virtqueues and stuff in the dump. > > Did you consider madvise()? > > MADV_DONTDUMP (since Linux 3.4) > Exclude from a core dump those pages in the range > specified by addr and length. This is useful in applications that > have large areas of memory that are known not to be useful in a core > dump. The effect of MADV_DONT‐ > DUMP takes precedence over the bit mask that is set via > the /proc/[pid]/coredump_filter file (see core(5)). > > MADV_DODUMP (since Linux 3.4) > Undo the effect of an earlier MADV_DONTDUMP. I don't think OVS actually knows location of particular VM memory pages that we do not need. And dumping virtqueues and stuff is, probably, the point of this patch (?). vhost-user library might have a better idea on which particular parts of the memory guest may use for virtqueues and buffers, but I'm not 100% sure. Best regards, Ilya Maximets.
On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>> Shouldn't this be 0x7f instead? > >>>> 0x3f doesn't enable bit #6, which is responsible for dumping > >>>> shared huge pages. Or am I missing something? > >>> > >>> That's a good point, the hugepage may or may not be private. I'll send > >>> in a new one. > >> > >> OK. One thing to think about though is that we'll grab > >> VM memory, I guess, in case we have vhost-user ports. > >> So, the core dump size can become insanely huge. > >> > >> The downside of not having them is inability to inspect > >> virtqueues and stuff in the dump. > > > > Did you consider madvise()? > > > > MADV_DONTDUMP (since Linux 3.4) > > Exclude from a core dump those pages in the range > > specified by addr and length. This is useful in applications that > > have large areas of memory that are known not to be useful in a core > > dump. The effect of MADV_DONT‐ > > DUMP takes precedence over the bit mask that is set via > > the /proc/[pid]/coredump_filter file (see core(5)). > > > > MADV_DODUMP (since Linux 3.4) > > Undo the effect of an earlier MADV_DONTDUMP. > > I don't think OVS actually knows location of particular VM memory > pages that we do not need. And dumping virtqueues and stuff is, > probably, the point of this patch (?). > > vhost-user library might have a better idea on which particular parts > of the memory guest may use for virtqueues and buffers, but I'm not > 100% sure. Yes, distinguishing hugepages of interest is a problem. Since v20.05, DPDK mem allocator takes care of excluding (unused) hugepages from dump. So with this OVS patch, if we catch private and shared hugepages, "interesting" DPDK hugepages will get dumped, which is useful for debugging post mortem. Adding Maxime, who will have a better idea of what is possible for the guest mapping part.
On 12/2/22 11:09, David Marchand wrote: > On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>> Shouldn't this be 0x7f instead? >>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping >>>>>> shared huge pages. Or am I missing something? >>>>> >>>>> That's a good point, the hugepage may or may not be private. I'll send >>>>> in a new one. >>>> >>>> OK. One thing to think about though is that we'll grab >>>> VM memory, I guess, in case we have vhost-user ports. >>>> So, the core dump size can become insanely huge. >>>> >>>> The downside of not having them is inability to inspect >>>> virtqueues and stuff in the dump. >>> >>> Did you consider madvise()? >>> >>> MADV_DONTDUMP (since Linux 3.4) >>> Exclude from a core dump those pages in the range >>> specified by addr and length. This is useful in applications that >>> have large areas of memory that are known not to be useful in a core >>> dump. The effect of MADV_DONT‐ >>> DUMP takes precedence over the bit mask that is set via >>> the /proc/[pid]/coredump_filter file (see core(5)). >>> >>> MADV_DODUMP (since Linux 3.4) >>> Undo the effect of an earlier MADV_DONTDUMP. >> >> I don't think OVS actually knows location of particular VM memory >> pages that we do not need. And dumping virtqueues and stuff is, >> probably, the point of this patch (?). >> >> vhost-user library might have a better idea on which particular parts >> of the memory guest may use for virtqueues and buffers, but I'm not >> 100% sure. > > Yes, distinguishing hugepages of interest is a problem. > > Since v20.05, DPDK mem allocator takes care of excluding (unused) > hugepages from dump. > So with this OVS patch, if we catch private and shared hugepages, > "interesting" DPDK hugepages will get dumped, which is useful for > debugging post mortem. > > Adding Maxime, who will have a better idea of what is possible for the > guest mapping part. > > I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap time, then there are two cases: a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues memory. Doing so, we would have the rings memory, but not their buffers (except if they are located on same hugepages). b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE new entries and MADV_DONTDUMP on invalidated entries. Doing so we will get both vrings and their buffers the backend is allowed to access. I can prepare a PoC quickly if someone is willing to experiment. Regards, Maxime
On Fri, Dec 2, 2022 at 5:37 AM Maxime Coquelin <maxime.coquelin@redhat.com> wrote: > > > > On 12/2/22 11:09, David Marchand wrote: > > On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>> Shouldn't this be 0x7f instead? > >>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping > >>>>>> shared huge pages. Or am I missing something? > >>>>> > >>>>> That's a good point, the hugepage may or may not be private. I'll send > >>>>> in a new one. > >>>> > >>>> OK. One thing to think about though is that we'll grab > >>>> VM memory, I guess, in case we have vhost-user ports. > >>>> So, the core dump size can become insanely huge. > >>>> > >>>> The downside of not having them is inability to inspect > >>>> virtqueues and stuff in the dump. > >>> > >>> Did you consider madvise()? > >>> > >>> MADV_DONTDUMP (since Linux 3.4) > >>> Exclude from a core dump those pages in the range > >>> specified by addr and length. This is useful in applications that > >>> have large areas of memory that are known not to be useful in a core > >>> dump. The effect of MADV_DONT‐ > >>> DUMP takes precedence over the bit mask that is set via > >>> the /proc/[pid]/coredump_filter file (see core(5)). > >>> > >>> MADV_DODUMP (since Linux 3.4) > >>> Undo the effect of an earlier MADV_DONTDUMP. > >> > >> I don't think OVS actually knows location of particular VM memory > >> pages that we do not need. And dumping virtqueues and stuff is, > >> probably, the point of this patch (?). > >> > >> vhost-user library might have a better idea on which particular parts > >> of the memory guest may use for virtqueues and buffers, but I'm not > >> 100% sure. > > > > Yes, distinguishing hugepages of interest is a problem. > > > > Since v20.05, DPDK mem allocator takes care of excluding (unused) > > hugepages from dump. > > So with this OVS patch, if we catch private and shared hugepages, > > "interesting" DPDK hugepages will get dumped, which is useful for > > debugging post mortem. > > > > Adding Maxime, who will have a better idea of what is possible for the > > guest mapping part. > > > > > > I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap > time, then there are two cases: > a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues > memory. Doing so, we would have the rings memory, but not their buffers > (except if they are located on same hugepages). > b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE > new entries and MADV_DONTDUMP on invalidated entries. Doing so we will > get both vrings and their buffers the backend is allowed to access. > > I can prepare a PoC quickly if someone is willing to experiment. A big motivation for this patch is DPDK becomes a big black hole in coredumps, preventing even netdev structures from being enumerated. I threw together a small PoC for this yesterday, only marking vhost mmaps as DONTDUMP. The result was that a simple OVS configuration with one vhost attached to an 8gb VM dropped the zstd compressed coredump including shared huge pages from 486.4M to 19.8M. The resulting coredump was also significantly more debuggable, all internal OVS datastructures became available as well as many DPDK ones. I'll look at cleaning things up and submitting to the DPDK mailing list. Cheers, M > > Regards, > Maxime > >
On 12/2/22 15:59, Mike Pattrick wrote: > On Fri, Dec 2, 2022 at 5:37 AM Maxime Coquelin > <maxime.coquelin@redhat.com> wrote: >> >> >> >> On 12/2/22 11:09, David Marchand wrote: >>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>> Shouldn't this be 0x7f instead? >>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping >>>>>>>> shared huge pages. Or am I missing something? >>>>>>> >>>>>>> That's a good point, the hugepage may or may not be private. I'll send >>>>>>> in a new one. >>>>>> >>>>>> OK. One thing to think about though is that we'll grab >>>>>> VM memory, I guess, in case we have vhost-user ports. >>>>>> So, the core dump size can become insanely huge. >>>>>> >>>>>> The downside of not having them is inability to inspect >>>>>> virtqueues and stuff in the dump. >>>>> >>>>> Did you consider madvise()? >>>>> >>>>> MADV_DONTDUMP (since Linux 3.4) >>>>> Exclude from a core dump those pages in the range >>>>> specified by addr and length. This is useful in applications that >>>>> have large areas of memory that are known not to be useful in a core >>>>> dump. The effect of MADV_DONT‐ >>>>> DUMP takes precedence over the bit mask that is set via >>>>> the /proc/[pid]/coredump_filter file (see core(5)). >>>>> >>>>> MADV_DODUMP (since Linux 3.4) >>>>> Undo the effect of an earlier MADV_DONTDUMP. >>>> >>>> I don't think OVS actually knows location of particular VM memory >>>> pages that we do not need. And dumping virtqueues and stuff is, >>>> probably, the point of this patch (?). >>>> >>>> vhost-user library might have a better idea on which particular parts >>>> of the memory guest may use for virtqueues and buffers, but I'm not >>>> 100% sure. >>> >>> Yes, distinguishing hugepages of interest is a problem. >>> >>> Since v20.05, DPDK mem allocator takes care of excluding (unused) >>> hugepages from dump. >>> So with this OVS patch, if we catch private and shared hugepages, >>> "interesting" DPDK hugepages will get dumped, which is useful for >>> debugging post mortem. >>> >>> Adding Maxime, who will have a better idea of what is possible for the >>> guest mapping part. >>> >>> >> >> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap >> time, then there are two cases: >> a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues >> memory. Doing so, we would have the rings memory, but not their buffers >> (except if they are located on same hugepages). >> b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE >> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will >> get both vrings and their buffers the backend is allowed to access. >> >> I can prepare a PoC quickly if someone is willing to experiment. > > A big motivation for this patch is DPDK becomes a big black hole in > coredumps, preventing even netdev structures from being enumerated. > > I threw together a small PoC for this yesterday, only marking vhost > mmaps as DONTDUMP. The result was that a simple OVS configuration with > one vhost attached to an 8gb VM dropped the zstd compressed coredump > including shared huge pages from 486.4M to 19.8M. The resulting > coredump was also significantly more debuggable, all internal OVS > datastructures became available as well as many DPDK ones. > > I'll look at cleaning things up and submitting to the DPDK mailing list. Thanks for doing the PoC, that will definitely help debugging! I have quickly drafted a patch doing the MADV_DODUMP on the area of interest for Vhost as described in my initial reply, could you have a try with it? (disclaimer: not even build tested) Maxime ========================================================================= diff --git a/lib/vhost/iotlb.c b/lib/vhost/iotlb.c index 6a729e8804..555b4ebba0 100644 --- a/lib/vhost/iotlb.c +++ b/lib/vhost/iotlb.c @@ -149,6 +149,7 @@ vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq) rte_rwlock_write_lock(&vq->iotlb_lock); RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) { + madvise(node->uaddr, node->size, MADV_DODUMP); TAILQ_REMOVE(&vq->iotlb_list, node, next); vhost_user_iotlb_pool_put(vq, node); } @@ -170,6 +171,7 @@ vhost_user_iotlb_cache_random_evict(struct vhost_virtqueue *vq) RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) { if (!entry_idx) { + madvise(node->uaddr, node->size, MADV_DODUMP); TAILQ_REMOVE(&vq->iotlb_list, node, next); vhost_user_iotlb_pool_put(vq, node); vq->iotlb_cache_nr--; @@ -222,12 +224,14 @@ vhost_user_iotlb_cache_insert(struct virtio_net *dev, struct vhost_virtqueue *vq vhost_user_iotlb_pool_put(vq, new_node); goto unlock; } else if (node->iova > new_node->iova) { + madvise(new_node->uaddr, new_node->size, MADV_DODUMP); TAILQ_INSERT_BEFORE(node, new_node, next); vq->iotlb_cache_nr++; goto unlock; } } + madvise(new_node->uaddr, new_node->size, MADV_DODUMP); TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next); vq->iotlb_cache_nr++; @@ -255,6 +259,7 @@ vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq, break; if (iova < node->iova + node->size) { + madvise(node->uaddr, node->size, MADV_DODUMP); TAILQ_REMOVE(&vq->iotlb_list, node, next); vhost_user_iotlb_pool_put(vq, node); vq->iotlb_cache_nr--; diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c index 371d6304d6..208377ddc6 100644 --- a/lib/vhost/vhost_user.c +++ b/lib/vhost/vhost_user.c @@ -767,6 +767,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq) return; } + madvise(vq->desc_packed, len, MADV_DODUMP); + numa_realloc(&dev, &vq); *pdev = dev; *pvq = vq; @@ -782,6 +784,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq) return; } + madvise(vq->driver_event, len, MADV_DODUMP); + len = sizeof(struct vring_packed_desc_event); vq->device_event = (struct vring_packed_desc_event *) (uintptr_t)ring_addr_to_vva(dev, @@ -793,6 +797,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq) return; } + madvise(vq->device_event, len, MADV_DODUMP); + vq->access_ok = true; return; } @@ -809,6 +815,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq) return; } + madvise(vq->desc, len, MADV_DODUMP); + numa_realloc(&dev, &vq); *pdev = dev; *pvq = vq; @@ -824,6 +832,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq) return; } + madvise(vq->avail, len, MADV_DODUMP); + len = sizeof(struct vring_used) + sizeof(struct vring_used_elem) * vq->size; if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) @@ -836,6 +846,8 @@ translate_ring_addresses(struct virtio_net **pdev, struct vhost_virtqueue **pvq) return; } + madvise(vq->used, len, MADV_DODUMP); + if (vq->last_used_idx != vq->used->idx) { VHOST_LOG_CONFIG(dev->ifname, WARNING, "last_used_idx (%u) and vq->used->idx (%u) mismatches;\n",
On Fri, Dec 2, 2022 at 10:20 AM Maxime Coquelin <maxime.coquelin@redhat.com> wrote: > > > > On 12/2/22 15:59, Mike Pattrick wrote: > > On Fri, Dec 2, 2022 at 5:37 AM Maxime Coquelin > > <maxime.coquelin@redhat.com> wrote: > >> > >> > >> > >> On 12/2/22 11:09, David Marchand wrote: > >>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>>> Shouldn't this be 0x7f instead? > >>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping > >>>>>>>> shared huge pages. Or am I missing something? > >>>>>>> > >>>>>>> That's a good point, the hugepage may or may not be private. I'll send > >>>>>>> in a new one. > >>>>>> > >>>>>> OK. One thing to think about though is that we'll grab > >>>>>> VM memory, I guess, in case we have vhost-user ports. > >>>>>> So, the core dump size can become insanely huge. > >>>>>> > >>>>>> The downside of not having them is inability to inspect > >>>>>> virtqueues and stuff in the dump. > >>>>> > >>>>> Did you consider madvise()? > >>>>> > >>>>> MADV_DONTDUMP (since Linux 3.4) > >>>>> Exclude from a core dump those pages in the range > >>>>> specified by addr and length. This is useful in applications that > >>>>> have large areas of memory that are known not to be useful in a core > >>>>> dump. The effect of MADV_DONT‐ > >>>>> DUMP takes precedence over the bit mask that is set via > >>>>> the /proc/[pid]/coredump_filter file (see core(5)). > >>>>> > >>>>> MADV_DODUMP (since Linux 3.4) > >>>>> Undo the effect of an earlier MADV_DONTDUMP. > >>>> > >>>> I don't think OVS actually knows location of particular VM memory > >>>> pages that we do not need. And dumping virtqueues and stuff is, > >>>> probably, the point of this patch (?). > >>>> > >>>> vhost-user library might have a better idea on which particular parts > >>>> of the memory guest may use for virtqueues and buffers, but I'm not > >>>> 100% sure. > >>> > >>> Yes, distinguishing hugepages of interest is a problem. > >>> > >>> Since v20.05, DPDK mem allocator takes care of excluding (unused) > >>> hugepages from dump. > >>> So with this OVS patch, if we catch private and shared hugepages, > >>> "interesting" DPDK hugepages will get dumped, which is useful for > >>> debugging post mortem. > >>> > >>> Adding Maxime, who will have a better idea of what is possible for the > >>> guest mapping part. > >>> > >>> > >> > >> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap > >> time, then there are two cases: > >> a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues > >> memory. Doing so, we would have the rings memory, but not their buffers > >> (except if they are located on same hugepages). > >> b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE > >> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will > >> get both vrings and their buffers the backend is allowed to access. > >> > >> I can prepare a PoC quickly if someone is willing to experiment. > > > > A big motivation for this patch is DPDK becomes a big black hole in > > coredumps, preventing even netdev structures from being enumerated. > > > > I threw together a small PoC for this yesterday, only marking vhost > > mmaps as DONTDUMP. The result was that a simple OVS configuration with > > one vhost attached to an 8gb VM dropped the zstd compressed coredump > > including shared huge pages from 486.4M to 19.8M. The resulting > > coredump was also significantly more debuggable, all internal OVS > > datastructures became available as well as many DPDK ones. > > > > I'll look at cleaning things up and submitting to the DPDK mailing list. > > Thanks for doing the PoC, that will definitely help debugging! > > I have quickly drafted a patch doing the MADV_DODUMP on the area of > interest for Vhost as described in my initial reply, could you have a > try with it? (disclaimer: not even build tested) Dump with shared huge pages Thu 2022-12-01 13:07:33 EST 213904 0 0 SIGSEGV present /usr/sbin/ovs-vswitchd 486.4M uncompressed size: 15,047,430,144 Dump with vhost mmaps set to dontdump Thu 2022-12-01 13:23:02 EST 217253 0 0 SIGSEGV present /usr/sbin/ovs-vswitchd 19.8M uncompressed size: 4,310,011,904 Dump with Maxime patch Fri 2022-12-02 11:34:16 EST 232661 0 0 SIGSEGV present /usr/sbin/ovs-vswitchd 19.7M uncompressed size: 4,310,007,808 Not sure where the 4kb went from yesterday to today -M > > Maxime > > ========================================================================= > > diff --git a/lib/vhost/iotlb.c b/lib/vhost/iotlb.c > index 6a729e8804..555b4ebba0 100644 > --- a/lib/vhost/iotlb.c > +++ b/lib/vhost/iotlb.c > @@ -149,6 +149,7 @@ vhost_user_iotlb_cache_remove_all(struct > vhost_virtqueue *vq) > rte_rwlock_write_lock(&vq->iotlb_lock); > > RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) { > + madvise(node->uaddr, node->size, MADV_DODUMP); > TAILQ_REMOVE(&vq->iotlb_list, node, next); > vhost_user_iotlb_pool_put(vq, node); > } > @@ -170,6 +171,7 @@ vhost_user_iotlb_cache_random_evict(struct > vhost_virtqueue *vq) > > RTE_TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) { > if (!entry_idx) { > + madvise(node->uaddr, node->size, MADV_DODUMP); > TAILQ_REMOVE(&vq->iotlb_list, node, next); > vhost_user_iotlb_pool_put(vq, node); > vq->iotlb_cache_nr--; > @@ -222,12 +224,14 @@ vhost_user_iotlb_cache_insert(struct virtio_net > *dev, struct vhost_virtqueue *vq > vhost_user_iotlb_pool_put(vq, new_node); > goto unlock; > } else if (node->iova > new_node->iova) { > + madvise(new_node->uaddr, new_node->size, > MADV_DODUMP); > TAILQ_INSERT_BEFORE(node, new_node, next); > vq->iotlb_cache_nr++; > goto unlock; > } > } > > + madvise(new_node->uaddr, new_node->size, MADV_DODUMP); > TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next); > vq->iotlb_cache_nr++; > > @@ -255,6 +259,7 @@ vhost_user_iotlb_cache_remove(struct vhost_virtqueue > *vq, > break; > > if (iova < node->iova + node->size) { > + madvise(node->uaddr, node->size, MADV_DODUMP); > TAILQ_REMOVE(&vq->iotlb_list, node, next); > vhost_user_iotlb_pool_put(vq, node); > vq->iotlb_cache_nr--; > diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c > index 371d6304d6..208377ddc6 100644 > --- a/lib/vhost/vhost_user.c > +++ b/lib/vhost/vhost_user.c > @@ -767,6 +767,8 @@ translate_ring_addresses(struct virtio_net **pdev, > struct vhost_virtqueue **pvq) > return; > } > > + madvise(vq->desc_packed, len, MADV_DODUMP); > + > numa_realloc(&dev, &vq); > *pdev = dev; > *pvq = vq; > @@ -782,6 +784,8 @@ translate_ring_addresses(struct virtio_net **pdev, > struct vhost_virtqueue **pvq) > return; > } > > + madvise(vq->driver_event, len, MADV_DODUMP); > + > len = sizeof(struct vring_packed_desc_event); > vq->device_event = (struct vring_packed_desc_event *) > (uintptr_t)ring_addr_to_vva(dev, > @@ -793,6 +797,8 @@ translate_ring_addresses(struct virtio_net **pdev, > struct vhost_virtqueue **pvq) > return; > } > > + madvise(vq->device_event, len, MADV_DODUMP); > + > vq->access_ok = true; > return; > } > @@ -809,6 +815,8 @@ translate_ring_addresses(struct virtio_net **pdev, > struct vhost_virtqueue **pvq) > return; > } > > + madvise(vq->desc, len, MADV_DODUMP); > + > numa_realloc(&dev, &vq); > *pdev = dev; > *pvq = vq; > @@ -824,6 +832,8 @@ translate_ring_addresses(struct virtio_net **pdev, > struct vhost_virtqueue **pvq) > return; > } > > + madvise(vq->avail, len, MADV_DODUMP); > + > len = sizeof(struct vring_used) + > sizeof(struct vring_used_elem) * vq->size; > if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) > @@ -836,6 +846,8 @@ translate_ring_addresses(struct virtio_net **pdev, > struct vhost_virtqueue **pvq) > return; > } > > + madvise(vq->used, len, MADV_DODUMP); > + > if (vq->last_used_idx != vq->used->idx) { > VHOST_LOG_CONFIG(dev->ifname, WARNING, > "last_used_idx (%u) and vq->used->idx (%u) > mismatches;\n", >
On 12/2/22 11:36, Maxime Coquelin wrote: > > > On 12/2/22 11:09, David Marchand wrote: >> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>> Shouldn't this be 0x7f instead? >>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping >>>>>>> shared huge pages. Or am I missing something? >>>>>> >>>>>> That's a good point, the hugepage may or may not be private. I'll send >>>>>> in a new one. >>>>> >>>>> OK. One thing to think about though is that we'll grab >>>>> VM memory, I guess, in case we have vhost-user ports. >>>>> So, the core dump size can become insanely huge. >>>>> >>>>> The downside of not having them is inability to inspect >>>>> virtqueues and stuff in the dump. >>>> >>>> Did you consider madvise()? >>>> >>>> MADV_DONTDUMP (since Linux 3.4) >>>> Exclude from a core dump those pages in the range >>>> specified by addr and length. This is useful in applications that >>>> have large areas of memory that are known not to be useful in a core >>>> dump. The effect of MADV_DONT‐ >>>> DUMP takes precedence over the bit mask that is set via >>>> the /proc/[pid]/coredump_filter file (see core(5)). >>>> >>>> MADV_DODUMP (since Linux 3.4) >>>> Undo the effect of an earlier MADV_DONTDUMP. >>> >>> I don't think OVS actually knows location of particular VM memory >>> pages that we do not need. And dumping virtqueues and stuff is, >>> probably, the point of this patch (?). >>> >>> vhost-user library might have a better idea on which particular parts >>> of the memory guest may use for virtqueues and buffers, but I'm not >>> 100% sure. >> >> Yes, distinguishing hugepages of interest is a problem. >> >> Since v20.05, DPDK mem allocator takes care of excluding (unused) >> hugepages from dump. >> So with this OVS patch, if we catch private and shared hugepages, >> "interesting" DPDK hugepages will get dumped, which is useful for >> debugging post mortem. >> >> Adding Maxime, who will have a better idea of what is possible for the >> guest mapping part. >> >> > > I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap > time, then there are two cases: > a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues > memory. Doing so, we would have the rings memory, but not their buffers > (except if they are located on same hugepages). > b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE > new entries and MADV_DONTDUMP on invalidated entries. Doing so we will > get both vrings and their buffers the backend is allowed to access. I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP will override whatever user had in their global configuration. Meaning every DPDK application with vhost ports will start dumping some of the guest pages with no actual ability to turn that off. Can the behavior be configurable? > > I can prepare a PoC quickly if someone is willing to experiment. > > Regards, > Maxime > >
On Fri, Dec 2, 2022 at 11:59 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 12/2/22 11:36, Maxime Coquelin wrote: > > > > > > On 12/2/22 11:09, David Marchand wrote: > >> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>> Shouldn't this be 0x7f instead? > >>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping > >>>>>>> shared huge pages. Or am I missing something? > >>>>>> > >>>>>> That's a good point, the hugepage may or may not be private. I'll send > >>>>>> in a new one. > >>>>> > >>>>> OK. One thing to think about though is that we'll grab > >>>>> VM memory, I guess, in case we have vhost-user ports. > >>>>> So, the core dump size can become insanely huge. > >>>>> > >>>>> The downside of not having them is inability to inspect > >>>>> virtqueues and stuff in the dump. > >>>> > >>>> Did you consider madvise()? > >>>> > >>>> MADV_DONTDUMP (since Linux 3.4) > >>>> Exclude from a core dump those pages in the range > >>>> specified by addr and length. This is useful in applications that > >>>> have large areas of memory that are known not to be useful in a core > >>>> dump. The effect of MADV_DONT‐ > >>>> DUMP takes precedence over the bit mask that is set via > >>>> the /proc/[pid]/coredump_filter file (see core(5)). > >>>> > >>>> MADV_DODUMP (since Linux 3.4) > >>>> Undo the effect of an earlier MADV_DONTDUMP. > >>> > >>> I don't think OVS actually knows location of particular VM memory > >>> pages that we do not need. And dumping virtqueues and stuff is, > >>> probably, the point of this patch (?). > >>> > >>> vhost-user library might have a better idea on which particular parts > >>> of the memory guest may use for virtqueues and buffers, but I'm not > >>> 100% sure. > >> > >> Yes, distinguishing hugepages of interest is a problem. > >> > >> Since v20.05, DPDK mem allocator takes care of excluding (unused) > >> hugepages from dump. > >> So with this OVS patch, if we catch private and shared hugepages, > >> "interesting" DPDK hugepages will get dumped, which is useful for > >> debugging post mortem. > >> > >> Adding Maxime, who will have a better idea of what is possible for the > >> guest mapping part. > >> > >> > > > > I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap > > time, then there are two cases: > > a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues > > memory. Doing so, we would have the rings memory, but not their buffers > > (except if they are located on same hugepages). > > b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE > > new entries and MADV_DONTDUMP on invalidated entries. Doing so we will > > get both vrings and their buffers the backend is allowed to access. > > I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP > will override whatever user had in their global configuration. Meaning > every DPDK application with vhost ports will start dumping some of the > guest pages with no actual ability to turn that off. I initially thought it would work that way, but the DODUMP flag just disables the DONTDUMP flag. https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055 https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033 Cheers, M > > Can the behavior be configurable? > > > > > I can prepare a PoC quickly if someone is willing to experiment. > > > > Regards, > > Maxime > > > > >
On 12/2/22 18:59, Mike Pattrick wrote: > On Fri, Dec 2, 2022 at 11:59 AM Ilya Maximets <i.maximets@ovn.org> wrote: >> >> On 12/2/22 11:36, Maxime Coquelin wrote: >>> >>> >>> On 12/2/22 11:09, David Marchand wrote: >>>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>>> Shouldn't this be 0x7f instead? >>>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping >>>>>>>>> shared huge pages. Or am I missing something? >>>>>>>> >>>>>>>> That's a good point, the hugepage may or may not be private. I'll send >>>>>>>> in a new one. >>>>>>> >>>>>>> OK. One thing to think about though is that we'll grab >>>>>>> VM memory, I guess, in case we have vhost-user ports. >>>>>>> So, the core dump size can become insanely huge. >>>>>>> >>>>>>> The downside of not having them is inability to inspect >>>>>>> virtqueues and stuff in the dump. >>>>>> >>>>>> Did you consider madvise()? >>>>>> >>>>>> MADV_DONTDUMP (since Linux 3.4) >>>>>> Exclude from a core dump those pages in the range >>>>>> specified by addr and length. This is useful in applications that >>>>>> have large areas of memory that are known not to be useful in a core >>>>>> dump. The effect of MADV_DONT‐ >>>>>> DUMP takes precedence over the bit mask that is set via >>>>>> the /proc/[pid]/coredump_filter file (see core(5)). >>>>>> >>>>>> MADV_DODUMP (since Linux 3.4) >>>>>> Undo the effect of an earlier MADV_DONTDUMP. >>>>> >>>>> I don't think OVS actually knows location of particular VM memory >>>>> pages that we do not need. And dumping virtqueues and stuff is, >>>>> probably, the point of this patch (?). >>>>> >>>>> vhost-user library might have a better idea on which particular parts >>>>> of the memory guest may use for virtqueues and buffers, but I'm not >>>>> 100% sure. >>>> >>>> Yes, distinguishing hugepages of interest is a problem. >>>> >>>> Since v20.05, DPDK mem allocator takes care of excluding (unused) >>>> hugepages from dump. >>>> So with this OVS patch, if we catch private and shared hugepages, >>>> "interesting" DPDK hugepages will get dumped, which is useful for >>>> debugging post mortem. >>>> >>>> Adding Maxime, who will have a better idea of what is possible for the >>>> guest mapping part. >>>> >>>> >>> >>> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap >>> time, then there are two cases: >>> a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues >>> memory. Doing so, we would have the rings memory, but not their buffers >>> (except if they are located on same hugepages). >>> b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE >>> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will >>> get both vrings and their buffers the backend is allowed to access. >> >> I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP >> will override whatever user had in their global configuration. Meaning >> every DPDK application with vhost ports will start dumping some of the >> guest pages with no actual ability to turn that off. > > I initially thought it would work that way, but the DODUMP flag just > disables the DONTDUMP flag. > > https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055 > https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033 Hmm, interesting. Makes sense. Thanks for the pointers! So, it should still be 7f regardless in the coredump filter for OVS, right? Do you plan to update the current patch or do you think we should omit shared pages until support for MADV_DO/DONTDUMP is added to vhost library? Note that this will likely not be available in 22.11 as it's not a bug fix. So, 23.11 at the earliest. Basically 2 options: 1. 0x3f and not having shared pages. Flip to 0x7f with DPDK 23.11 next year. Pros: Smaller files Cons: Missing some of the virtqueue memory until [potentially] DPDK 23.11. 2. 0x7f today. Pros: All the memory is available. Cons: [Significantly] larger files until [potentially] DPDK 23.11. What do you think? David, Maxime? > > Cheers, > M > >> >> Can the behavior be configurable? >> >>> >>> I can prepare a PoC quickly if someone is willing to experiment. >>> >>> Regards, >>> Maxime >>> >>> >> >
On Fri, Dec 2, 2022 at 1:40 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 12/2/22 18:59, Mike Pattrick wrote: > > On Fri, Dec 2, 2022 at 11:59 AM Ilya Maximets <i.maximets@ovn.org> wrote: > >> > >> On 12/2/22 11:36, Maxime Coquelin wrote: > >>> > >>> > >>> On 12/2/22 11:09, David Marchand wrote: > >>>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >>>>>>>>> Shouldn't this be 0x7f instead? > >>>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping > >>>>>>>>> shared huge pages. Or am I missing something? > >>>>>>>> > >>>>>>>> That's a good point, the hugepage may or may not be private. I'll send > >>>>>>>> in a new one. > >>>>>>> > >>>>>>> OK. One thing to think about though is that we'll grab > >>>>>>> VM memory, I guess, in case we have vhost-user ports. > >>>>>>> So, the core dump size can become insanely huge. > >>>>>>> > >>>>>>> The downside of not having them is inability to inspect > >>>>>>> virtqueues and stuff in the dump. > >>>>>> > >>>>>> Did you consider madvise()? > >>>>>> > >>>>>> MADV_DONTDUMP (since Linux 3.4) > >>>>>> Exclude from a core dump those pages in the range > >>>>>> specified by addr and length. This is useful in applications that > >>>>>> have large areas of memory that are known not to be useful in a core > >>>>>> dump. The effect of MADV_DONT‐ > >>>>>> DUMP takes precedence over the bit mask that is set via > >>>>>> the /proc/[pid]/coredump_filter file (see core(5)). > >>>>>> > >>>>>> MADV_DODUMP (since Linux 3.4) > >>>>>> Undo the effect of an earlier MADV_DONTDUMP. > >>>>> > >>>>> I don't think OVS actually knows location of particular VM memory > >>>>> pages that we do not need. And dumping virtqueues and stuff is, > >>>>> probably, the point of this patch (?). > >>>>> > >>>>> vhost-user library might have a better idea on which particular parts > >>>>> of the memory guest may use for virtqueues and buffers, but I'm not > >>>>> 100% sure. > >>>> > >>>> Yes, distinguishing hugepages of interest is a problem. > >>>> > >>>> Since v20.05, DPDK mem allocator takes care of excluding (unused) > >>>> hugepages from dump. > >>>> So with this OVS patch, if we catch private and shared hugepages, > >>>> "interesting" DPDK hugepages will get dumped, which is useful for > >>>> debugging post mortem. > >>>> > >>>> Adding Maxime, who will have a better idea of what is possible for the > >>>> guest mapping part. > >>>> > >>>> > >>> > >>> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap > >>> time, then there are two cases: > >>> a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues > >>> memory. Doing so, we would have the rings memory, but not their buffers > >>> (except if they are located on same hugepages). > >>> b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE > >>> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will > >>> get both vrings and their buffers the backend is allowed to access. > >> > >> I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP > >> will override whatever user had in their global configuration. Meaning > >> every DPDK application with vhost ports will start dumping some of the > >> guest pages with no actual ability to turn that off. > > > > I initially thought it would work that way, but the DODUMP flag just > > disables the DONTDUMP flag. > > > > https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055 > > https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033 > > Hmm, interesting. Makes sense. > > Thanks for the pointers! > > So, it should still be 7f regardless in the coredump filter for OVS, right? > Do you plan to update the current patch or do you think we should omit > shared pages until support for MADV_DO/DONTDUMP is added to vhost library? > > Note that this will likely not be available in 22.11 as it's not a bug fix. > So, 23.11 at the earliest. > > Basically 2 options: > > 1. 0x3f and not having shared pages. Flip to 0x7f with DPDK 23.11 next year. > Pros: Smaller files > Cons: Missing some of the virtqueue memory until [potentially] DPDK 23.11. > > 2. 0x7f today. > Pros: All the memory is available. > Cons: [Significantly] larger files until [potentially] DPDK 23.11. > > What do you think? David, Maxime? I'd prefer 7f today. It's disabled by default, has zero impact on end users, makes setting up debugging environments more convenient, and on distributions with systemd the larger coredumps are managed somewhat automatically. The news item already warns about large coredumps. WDYT? -M > > > > Cheers, > > M > > > >> > >> Can the behavior be configurable? > >> > >>> > >>> I can prepare a PoC quickly if someone is willing to experiment. > >>> > >>> Regards, > >>> Maxime > >>> > >>> > >> > > >
On Fri, Dec 2, 2022 at 7:00 PM Mike Pattrick <mkp@redhat.com> wrote: > > >>>> Did you consider madvise()? > > >>>> > > >>>> MADV_DONTDUMP (since Linux 3.4) > > >>>> Exclude from a core dump those pages in the range > > >>>> specified by addr and length. This is useful in applications that > > >>>> have large areas of memory that are known not to be useful in a core > > >>>> dump. The effect of MADV_DONT‐ > > >>>> DUMP takes precedence over the bit mask that is set via > > >>>> the /proc/[pid]/coredump_filter file (see core(5)). > > >>>> > > >>>> MADV_DODUMP (since Linux 3.4) > > >>>> Undo the effect of an earlier MADV_DONTDUMP. > > >>> > > I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP > > will override whatever user had in their global configuration. Meaning > > every DPDK application with vhost ports will start dumping some of the > > guest pages with no actual ability to turn that off. > > I initially thought it would work that way, but the DODUMP flag just > disables the DONTDUMP flag. > > https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055 > https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033 > Glad to read that the manual tells the same story than the kernel code :-).
On 12/2/22 21:14, Mike Pattrick wrote: > On Fri, Dec 2, 2022 at 1:40 PM Ilya Maximets <i.maximets@ovn.org> wrote: >> >> On 12/2/22 18:59, Mike Pattrick wrote: >>> On Fri, Dec 2, 2022 at 11:59 AM Ilya Maximets <i.maximets@ovn.org> wrote: >>>> >>>> On 12/2/22 11:36, Maxime Coquelin wrote: >>>>> >>>>> >>>>> On 12/2/22 11:09, David Marchand wrote: >>>>>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote: >>>>>>>>>>> Shouldn't this be 0x7f instead? >>>>>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping >>>>>>>>>>> shared huge pages. Or am I missing something? >>>>>>>>>> >>>>>>>>>> That's a good point, the hugepage may or may not be private. I'll send >>>>>>>>>> in a new one. >>>>>>>>> >>>>>>>>> OK. One thing to think about though is that we'll grab >>>>>>>>> VM memory, I guess, in case we have vhost-user ports. >>>>>>>>> So, the core dump size can become insanely huge. >>>>>>>>> >>>>>>>>> The downside of not having them is inability to inspect >>>>>>>>> virtqueues and stuff in the dump. >>>>>>>> >>>>>>>> Did you consider madvise()? >>>>>>>> >>>>>>>> MADV_DONTDUMP (since Linux 3.4) >>>>>>>> Exclude from a core dump those pages in the range >>>>>>>> specified by addr and length. This is useful in applications that >>>>>>>> have large areas of memory that are known not to be useful in a core >>>>>>>> dump. The effect of MADV_DONT‐ >>>>>>>> DUMP takes precedence over the bit mask that is set via >>>>>>>> the /proc/[pid]/coredump_filter file (see core(5)). >>>>>>>> >>>>>>>> MADV_DODUMP (since Linux 3.4) >>>>>>>> Undo the effect of an earlier MADV_DONTDUMP. >>>>>>> >>>>>>> I don't think OVS actually knows location of particular VM memory >>>>>>> pages that we do not need. And dumping virtqueues and stuff is, >>>>>>> probably, the point of this patch (?). >>>>>>> >>>>>>> vhost-user library might have a better idea on which particular parts >>>>>>> of the memory guest may use for virtqueues and buffers, but I'm not >>>>>>> 100% sure. >>>>>> >>>>>> Yes, distinguishing hugepages of interest is a problem. >>>>>> >>>>>> Since v20.05, DPDK mem allocator takes care of excluding (unused) >>>>>> hugepages from dump. >>>>>> So with this OVS patch, if we catch private and shared hugepages, >>>>>> "interesting" DPDK hugepages will get dumped, which is useful for >>>>>> debugging post mortem. >>>>>> >>>>>> Adding Maxime, who will have a better idea of what is possible for the >>>>>> guest mapping part. >>>>>> >>>>>> >>>>> >>>>> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap >>>>> time, then there are two cases: >>>>> a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues >>>>> memory. Doing so, we would have the rings memory, but not their buffers >>>>> (except if they are located on same hugepages). >>>>> b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE >>>>> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will >>>>> get both vrings and their buffers the backend is allowed to access. >>>> >>>> I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP >>>> will override whatever user had in their global configuration. Meaning >>>> every DPDK application with vhost ports will start dumping some of the >>>> guest pages with no actual ability to turn that off. >>> >>> I initially thought it would work that way, but the DODUMP flag just >>> disables the DONTDUMP flag. >>> >>> https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055 >>> https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033 >> >> Hmm, interesting. Makes sense. >> >> Thanks for the pointers! >> >> So, it should still be 7f regardless in the coredump filter for OVS, right? >> Do you plan to update the current patch or do you think we should omit >> shared pages until support for MADV_DO/DONTDUMP is added to vhost library? >> >> Note that this will likely not be available in 22.11 as it's not a bug fix. >> So, 23.11 at the earliest. >> >> Basically 2 options: >> >> 1. 0x3f and not having shared pages. Flip to 0x7f with DPDK 23.11 next year. >> Pros: Smaller files >> Cons: Missing some of the virtqueue memory until [potentially] DPDK 23.11. >> >> 2. 0x7f today. >> Pros: All the memory is available. >> Cons: [Significantly] larger files until [potentially] DPDK 23.11. >> >> What do you think? David, Maxime? > > I'd prefer 7f today. It's disabled by default, has zero impact on end > users, makes setting up debugging environments more convenient, and on > distributions with systemd the larger coredumps are managed somewhat > automatically. The news item already warns about large coredumps. > > WDYT? Sounds good to me. > > -M > >>> >>> Cheers, >>> M >>> >>>> >>>> Can the behavior be configurable? >>>> >>>>> >>>>> I can prepare a PoC quickly if someone is willing to experiment. >>>>> >>>>> Regards, >>>>> Maxime >>>>> >>>>> >>>> >>> >> >
On Fri, Dec 2, 2022 at 9:14 PM Mike Pattrick <mkp@redhat.com> wrote: > > On Fri, Dec 2, 2022 at 1:40 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > > > On 12/2/22 18:59, Mike Pattrick wrote: > > > On Fri, Dec 2, 2022 at 11:59 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > >> > > >> On 12/2/22 11:36, Maxime Coquelin wrote: > > >>> > > >>> > > >>> On 12/2/22 11:09, David Marchand wrote: > > >>>> On Wed, Nov 30, 2022 at 9:30 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > >>>>>>>>> Shouldn't this be 0x7f instead? > > >>>>>>>>> 0x3f doesn't enable bit #6, which is responsible for dumping > > >>>>>>>>> shared huge pages. Or am I missing something? > > >>>>>>>> > > >>>>>>>> That's a good point, the hugepage may or may not be private. I'll send > > >>>>>>>> in a new one. > > >>>>>>> > > >>>>>>> OK. One thing to think about though is that we'll grab > > >>>>>>> VM memory, I guess, in case we have vhost-user ports. > > >>>>>>> So, the core dump size can become insanely huge. > > >>>>>>> > > >>>>>>> The downside of not having them is inability to inspect > > >>>>>>> virtqueues and stuff in the dump. > > >>>>>> > > >>>>>> Did you consider madvise()? > > >>>>>> > > >>>>>> MADV_DONTDUMP (since Linux 3.4) > > >>>>>> Exclude from a core dump those pages in the range > > >>>>>> specified by addr and length. This is useful in applications that > > >>>>>> have large areas of memory that are known not to be useful in a core > > >>>>>> dump. The effect of MADV_DONT‐ > > >>>>>> DUMP takes precedence over the bit mask that is set via > > >>>>>> the /proc/[pid]/coredump_filter file (see core(5)). > > >>>>>> > > >>>>>> MADV_DODUMP (since Linux 3.4) > > >>>>>> Undo the effect of an earlier MADV_DONTDUMP. > > >>>>> > > >>>>> I don't think OVS actually knows location of particular VM memory > > >>>>> pages that we do not need. And dumping virtqueues and stuff is, > > >>>>> probably, the point of this patch (?). > > >>>>> > > >>>>> vhost-user library might have a better idea on which particular parts > > >>>>> of the memory guest may use for virtqueues and buffers, but I'm not > > >>>>> 100% sure. > > >>>> > > >>>> Yes, distinguishing hugepages of interest is a problem. > > >>>> > > >>>> Since v20.05, DPDK mem allocator takes care of excluding (unused) > > >>>> hugepages from dump. > > >>>> So with this OVS patch, if we catch private and shared hugepages, > > >>>> "interesting" DPDK hugepages will get dumped, which is useful for > > >>>> debugging post mortem. > > >>>> > > >>>> Adding Maxime, who will have a better idea of what is possible for the > > >>>> guest mapping part. > > >>>> > > >>>> > > >>> > > >>> I wonder if we could do a MADV_DONTDUMP on all the guest memory at mmap > > >>> time, then there are two cases: > > >>> a. vIOMMU = OFF. In this case we could do MADV_DODUMP on virtqueues > > >>> memory. Doing so, we would have the rings memory, but not their buffers > > >>> (except if they are located on same hugepages). > > >>> b. vIOMMU = ON. In this case we could do MADV_DODUMP on IOTLB_UPDATE > > >>> new entries and MADV_DONTDUMP on invalidated entries. Doing so we will > > >>> get both vrings and their buffers the backend is allowed to access. > > >> > > >> I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP > > >> will override whatever user had in their global configuration. Meaning > > >> every DPDK application with vhost ports will start dumping some of the > > >> guest pages with no actual ability to turn that off. > > > > > > I initially thought it would work that way, but the DODUMP flag just > > > disables the DONTDUMP flag. > > > > > > https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055 > > > https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033 > > > > Hmm, interesting. Makes sense. > > > > Thanks for the pointers! > > > > So, it should still be 7f regardless in the coredump filter for OVS, right? > > Do you plan to update the current patch or do you think we should omit > > shared pages until support for MADV_DO/DONTDUMP is added to vhost library? > > > > Note that this will likely not be available in 22.11 as it's not a bug fix. > > So, 23.11 at the earliest. > > > > Basically 2 options: > > > > 1. 0x3f and not having shared pages. Flip to 0x7f with DPDK 23.11 next year. > > Pros: Smaller files > > Cons: Missing some of the virtqueue memory until [potentially] DPDK 23.11. Mm, if someone still has some --socket-mem config, then I guess shared hugepages will be in use in DPDK. > > > > 2. 0x7f today. > > Pros: All the memory is available. > > Cons: [Significantly] larger files until [potentially] DPDK 23.11. > > > > What do you think? David, Maxime? > > I'd prefer 7f today. It's disabled by default, has zero impact on end > users, makes setting up debugging environments more convenient, and on > distributions with systemd the larger coredumps are managed somewhat > automatically. The news item already warns about large coredumps. I prefer the latter suggestion.
On 12/5/22 13:46, David Marchand wrote: > On Fri, Dec 2, 2022 at 7:00 PM Mike Pattrick <mkp@redhat.com> wrote: >>>>>>> Did you consider madvise()? >>>>>>> >>>>>>> MADV_DONTDUMP (since Linux 3.4) >>>>>>> Exclude from a core dump those pages in the range >>>>>>> specified by addr and length. This is useful in applications that >>>>>>> have large areas of memory that are known not to be useful in a core >>>>>>> dump. The effect of MADV_DONT‐ >>>>>>> DUMP takes precedence over the bit mask that is set via >>>>>>> the /proc/[pid]/coredump_filter file (see core(5)). >>>>>>> >>>>>>> MADV_DODUMP (since Linux 3.4) >>>>>>> Undo the effect of an earlier MADV_DONTDUMP. >>>>>> >>> I guess, while DONTDUMP calls are mainly harmless, the explicit DODUMP >>> will override whatever user had in their global configuration. Meaning >>> every DPDK application with vhost ports will start dumping some of the >>> guest pages with no actual ability to turn that off. >> >> I initially thought it would work that way, but the DODUMP flag just >> disables the DONTDUMP flag. >> >> https://github.com/torvalds/linux/blob/master/mm/madvise.c#L1055 >> https://github.com/torvalds/linux/blob/master/fs/coredump.c#L1033 >> > > Glad to read that the manual tells the same story than the kernel code :-). Manuals, pfff. I seem to automatically skip them even if quoted directly in the thread. :D
diff --git a/NEWS b/NEWS index 8fa57836a..7af60dce3 100644 --- a/NEWS +++ b/NEWS @@ -3,6 +3,10 @@ Post-v2.17.0 - OVSDB: * 'relay' service model now supports transaction history, i.e. honors the 'last-txn-id' field in 'monitor_cond_since' requests from clients. + - ovs-ctl: + * New option '--dump-hugepages' to include hugepages in core dumps. This + can assist with postmortem analysis involving DPDK, but may also produce + significantly larger core dump files. v2.17.0 - 17 Feb 2022 diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in index e6e07f476..8f900314b 100644 --- a/utilities/ovs-ctl.in +++ b/utilities/ovs-ctl.in @@ -103,8 +103,13 @@ set_system_ids () { action "Configuring Open vSwitch system IDs" "$@" $extra_ids } -check_force_cores () { - if test X"$FORCE_COREFILES" = Xyes; then +check_core_config () { + if test X"$DUMP_HUGEPAGES" = Xyes; then + echo 0x3f > /proc/self/coredump_filter + if test X"$FORCE_COREFILES" = Xyes; then + ulimit -c unlimited + fi + elif test X"$FORCE_COREFILES" = Xyes; then ulimit -c 67108864 fi } @@ -116,7 +121,7 @@ del_transient_ports () { } do_start_ovsdb () { - check_force_cores + check_core_config if daemon_is_running ovsdb-server; then log_success_msg "ovsdb-server is already running" @@ -193,7 +198,7 @@ add_managers () { } do_start_forwarding () { - check_force_cores + check_core_config insert_mod_if_required || return 1 @@ -330,6 +335,7 @@ set_defaults () { DAEMON_CWD=/ FORCE_COREFILES=yes + DUMP_HUGEPAGES=no MLOCKALL=yes SELF_CONFINEMENT=yes MONITOR=yes @@ -419,6 +425,7 @@ Other important options for "start", "restart" and "force-reload-kmod": Less important options for "start", "restart" and "force-reload-kmod": --daemon-cwd=DIR set working dir for OVS daemons (default: $DAEMON_CWD) --no-force-corefiles do not force on core dumps for OVS daemons + --dump-hugepages include hugepages in coredumps --no-mlockall do not lock all of ovs-vswitchd into memory --ovsdb-server-priority=NICE set ovsdb-server's niceness (default: $OVSDB_SERVER_PRIORITY) --ovsdb-server-options=OPTIONS additional options for ovsdb-server (example: '-vconsole:dbg -vfile:dbg')
Add new option --dump-hugepages option in ovs-ctl to enable the addition of hugepages in the core dump filter. Signed-off-by: Mike Pattrick <mkp@redhat.com> --- NEWS | 4 ++++ utilities/ovs-ctl.in | 15 +++++++++++---- 2 files changed, 15 insertions(+), 4 deletions(-)