diff mbox series

[ovs-dev] datapath: Prevent panic

Message ID 1523901533-3510-1-git-send-email-gvrose8192@gmail.com
State Superseded
Headers show
Series [ovs-dev] datapath: Prevent panic | expand

Commit Message

Gregory Rose April 16, 2018, 5:58 p.m. UTC
On RHEL 7.x kernels we observe a panic induced by a paging error
when the timer kicks off a job that subsequently accesses memory
that belonged to the openvswitch kernel module but was since
unloaded - thus the paging error.

The panic can be induced on any RHEL 7.x kernel with the following test:

while `true`
do
    make check-kmod TESTSUITEFLAGS="-k \!gre"
done

On the systems I've been testing on it generally takes anywhere from a
minute to 15 minutes or so to repro but never longer than that.  Similar
results have been seen by other testers.

This patch does not fix the underlying bug, which does need to be
investigated and fixed, but it does prevent it from occurring. We
would like to prevent customer systems from panicking while we do
futher investigation to find the root cause.

Signed-off-by: Greg Rose <gvrose8192@gmail.com>
---
 datapath/datapath.c         | 10 ++++++++++
 tests/system-kmod-macros.at |  1 +
 utilities/ovs-lib.in        |  1 +
 3 files changed, 12 insertions(+)

Comments

Pravin Shelar April 17, 2018, 6:32 a.m. UTC | #1
On Mon, Apr 16, 2018 at 10:58 AM, Greg Rose <gvrose8192@gmail.com> wrote:
> On RHEL 7.x kernels we observe a panic induced by a paging error
> when the timer kicks off a job that subsequently accesses memory
> that belonged to the openvswitch kernel module but was since
> unloaded - thus the paging error.
>
> The panic can be induced on any RHEL 7.x kernel with the following test:
>
> while `true`
> do
>     make check-kmod TESTSUITEFLAGS="-k \!gre"
> done
>
> On the systems I've been testing on it generally takes anywhere from a
> minute to 15 minutes or so to repro but never longer than that.  Similar
> results have been seen by other testers.
>
> This patch does not fix the underlying bug, which does need to be
> investigated and fixed, but it does prevent it from occurring. We
> would like to prevent customer systems from panicking while we do
> futher investigation to find the root cause.
>
Can you add stack trace to the commit ?
Gregory Rose April 17, 2018, 7:03 p.m. UTC | #2
On 4/16/2018 11:32 PM, Pravin Shelar wrote:
> On Mon, Apr 16, 2018 at 10:58 AM, Greg Rose <gvrose8192@gmail.com> wrote:
>> On RHEL 7.x kernels we observe a panic induced by a paging error
>> when the timer kicks off a job that subsequently accesses memory
>> that belonged to the openvswitch kernel module but was since
>> unloaded - thus the paging error.
>>
>> The panic can be induced on any RHEL 7.x kernel with the following test:
>>
>> while `true`
>> do
>>      make check-kmod TESTSUITEFLAGS="-k \!gre"
>> done
>>
>> On the systems I've been testing on it generally takes anywhere from a
>> minute to 15 minutes or so to repro but never longer than that.  Similar
>> results have been seen by other testers.
>>
>> This patch does not fix the underlying bug, which does need to be
>> investigated and fixed, but it does prevent it from occurring. We
>> would like to prevent customer systems from panicking while we do
>> futher investigation to find the root cause.
>>
> Can you add stack trace to the commit ?

Sure, I'll send a V2 in a bit.

Thanks,

- Greg
diff mbox series

Patch

diff --git a/datapath/datapath.c b/datapath/datapath.c
index 3ea240a..43f0d74 100644
--- a/datapath/datapath.c
+++ b/datapath/datapath.c
@@ -2478,6 +2478,16 @@  error:
 
 static void dp_cleanup(void)
 {
+#if RHEL_RELEASE_CODE < RHEL_RELEASE_VERSION(8,0)
+	/* On RHEL 7.x kernels we hit a kernel paging error without
+	 * this barrier and subsequent hefty delay.  A process will
+	 * attempt to access openvwitch memory after it has been
+	 * unloaded.  Further debugging is needed on that but for
+	 * now let's not let customer machines panic.
+	 */
+	rcu_barrier();
+	msleep(3000);
+#endif
 	dp_unregister_genl(ARRAY_SIZE(dp_genl_families));
 	ovs_netdev_exit();
 	unregister_netdevice_notifier(&ovs_dp_device_notifier);
diff --git a/tests/system-kmod-macros.at b/tests/system-kmod-macros.at
index f23a406..2b9b691 100644
--- a/tests/system-kmod-macros.at
+++ b/tests/system-kmod-macros.at
@@ -23,6 +23,7 @@  m4_define([OVS_TRAFFIC_VSWITCHD_START],
                on_exit 'modprobe -q -r mod'
               ])
    on_exit 'ovs-dpctl del-dp ovs-system'
+   on_exit 'ovs-appctl dpctl/flush-conntrack'
    _OVS_VSWITCHD_START([])
    dnl Add bridges, ports, etc.
    AT_CHECK([ovs-vsctl -- _ADD_BR([br0]) -- $1 m4_if([$2], [], [], [| uuidfilt])], [0], [$2])
diff --git a/utilities/ovs-lib.in b/utilities/ovs-lib.in
index 4dc3151..4c3ad0f 100644
--- a/utilities/ovs-lib.in
+++ b/utilities/ovs-lib.in
@@ -616,6 +616,7 @@  force_reload_kmod () {
     for dp in `ovs-dpctl dump-dps`; do
         action "Removing datapath: $dp" ovs-dpctl del-dp "$dp"
     done
+    action "ovs-appctl dpctl/flush-conntrack"
 
     for vport in `awk '/^vport_/ { print $1 }' /proc/modules`; do
         action "Removing $vport module" rmmod $vport