Message ID | 60e827cb-2bba-2b7e-55dc-651103e9905f@huawei.com |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
Series | vrf: Fix possible NULL pointer oops when delete nic | expand |
On 11/14/19 11:22 PM, wangxiaogang (F) wrote: > From: XiaoGang Wang <wangxiaogang3@huawei.com> > > Recently we get a crash when access illegal address (0xc0), > which will occasionally appear when deleting a physical NIC with vrf. > How long have you been running this test? I am wondering if this is fallout from the recent adjacency changes in commits 5343da4c1742 through f3b0a18bb6cb.
On 11/14/19 11:22 PM, wangxiaogang (F) wrote: > diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c > index b8228f5..86c4b8c 100644 > --- a/drivers/net/vrf.c > +++ b/drivers/net/vrf.c > @@ -1427,6 +1427,9 @@ static int vrf_device_event(struct notifier_block *unused, > goto out; > > vrf_dev = netdev_master_upper_dev_get(dev); > + if (!vrf_dev) > + goto out; > + > vrf_del_slave(vrf_dev, dev); > } > out: BTW, I believe this is the wrong fix. A device can not be a VRF slave AND not have an upper device. Something is fundamentally wrong.
From: "wangxiaogang (F)" <wangxiaogang3@huawei.com> Date: Fri, 15 Nov 2019 14:22:56 +0800 > From: XiaoGang Wang <wangxiaogang3@huawei.com> > > Recently we get a crash when access illegal address (0xc0), > which will occasionally appear when deleting a physical NIC with vrf. > > [166603.826737]hinic 0000:43:00.4 eth-s3: Failed to cycle device eth-s3; > route tables might be wrong! > ..... > [166603.828018]WARNING: CPU: 135 PID: 15382at net/core/dev.c:6875 > __netdev_adjacent_dev_remove.constprop.40+0x1e0/0x1e8 > ...... Taehee-ssi, please take a look at this. It is believed that this may be caused by the adjacency fixes you made recently. Thank you.
On Sun, 17 Nov 2019 at 05:53, David Miller <davem@davemloft.net> wrote: > Hi David, Thank you for Ccing! > From: "wangxiaogang (F)" <wangxiaogang3@huawei.com> > Date: Fri, 15 Nov 2019 14:22:56 +0800 > > > From: XiaoGang Wang <wangxiaogang3@huawei.com> > > > > Recently we get a crash when access illegal address (0xc0), > > which will occasionally appear when deleting a physical NIC with vrf. > > > > [166603.826737]hinic 0000:43:00.4 eth-s3: Failed to cycle device eth-s3; > > route tables might be wrong! > > ..... > > [166603.828018]WARNING: CPU: 135 PID: 15382at net/core/dev.c:6875 > > __netdev_adjacent_dev_remove.constprop.40+0x1e0/0x1e8 > > ...... > > Taehee-ssi, please take a look at this. > > It is believed that this may be caused by the adjacency fixes you made > recently. > I will take a look at this Thank you! > Thank you.
On 2019/11/15 21:14, David Ahern wrote: > On 11/14/19 11:22 PM, wangxiaogang (F) wrote: >> From: XiaoGang Wang <wangxiaogang3@huawei.com> >> >> Recently we get a crash when access illegal address (0xc0), >> which will occasionally appear when deleting a physical NIC with vrf. >> > > How long have you been running this test? > > I am wondering if this is fallout from the recent adjacency changes in > commits 5343da4c1742 through f3b0a18bb6cb. > > > > > Thank you so much for the reply, our kernel version is linux 4.19. this problem happened once in our production environment.
On 2019/11/16 0:59, David Ahern wrote: > On 11/14/19 11:22 PM, wangxiaogang (F) wrote: >> diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c >> index b8228f5..86c4b8c 100644 >> --- a/drivers/net/vrf.c >> +++ b/drivers/net/vrf.c >> @@ -1427,6 +1427,9 @@ static int vrf_device_event(struct notifier_block *unused, >> goto out; >> >> vrf_dev = netdev_master_upper_dev_get(dev); >> + if (!vrf_dev) >> + goto out; >> + >> vrf_del_slave(vrf_dev, dev); >> } >> out: > > BTW, I believe this is the wrong fix. A device can not be a VRF slave > AND not have an upper device. Something is fundamentally wrong. > > this problem occurs when our testers deleted the NIC and vrf in parallel. I will try to recurring this problem later.
On 11/17/19 8:16 PM, wangxiaogang (F) wrote: > > > On 2019/11/16 0:59, David Ahern wrote: >> On 11/14/19 11:22 PM, wangxiaogang (F) wrote: >>> diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c >>> index b8228f5..86c4b8c 100644 >>> --- a/drivers/net/vrf.c >>> +++ b/drivers/net/vrf.c >>> @@ -1427,6 +1427,9 @@ static int vrf_device_event(struct notifier_block *unused, >>> goto out; >>> >>> vrf_dev = netdev_master_upper_dev_get(dev); >>> + if (!vrf_dev) >>> + goto out; >>> + >>> vrf_del_slave(vrf_dev, dev); >>> } >>> out: >> >> BTW, I believe this is the wrong fix. A device can not be a VRF slave >> AND not have an upper device. Something is fundamentally wrong. >> >> > > this problem occurs when our testers deleted the NIC and vrf in parallel. > I will try to recurring this problem later. > The deletes are serial in the kernel due to the rtnl, but dev changes are under rcu...
On 11/17/19 8:15 PM, wangxiaogang (F) wrote: > > > On 2019/11/15 21:14, David Ahern wrote: >> On 11/14/19 11:22 PM, wangxiaogang (F) wrote: >>> From: XiaoGang Wang <wangxiaogang3@huawei.com> >>> >>> Recently we get a crash when access illegal address (0xc0), >>> which will occasionally appear when deleting a physical NIC with vrf. >>> >> >> How long have you been running this test? >> >> I am wondering if this is fallout from the recent adjacency changes in >> commits 5343da4c1742 through f3b0a18bb6cb. >> >> >> >> >> > Thank you so much for the reply, our kernel version is linux 4.19. > this problem happened once in our production environment. > ok, so the recent adjacency changes would not be at fault here.
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index b8228f5..86c4b8c 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -1427,6 +1427,9 @@ static int vrf_device_event(struct notifier_block *unused, goto out; vrf_dev = netdev_master_upper_dev_get(dev); + if (!vrf_dev) + goto out; + vrf_del_slave(vrf_dev, dev); } out: