Message ID | 20190405175626.4123-1-vladbu@mellanox.com |
---|---|
State | Accepted |
Delegated to: | David Miller |
Headers | show |
Series | [net-next] net: sched: flower: insert filter to ht before offloading it to hw | expand |
Fri, Apr 05, 2019 at 07:56:26PM CEST, vladbu@mellanox.com wrote: >John reports: > >Recent refactoring of fl_change aims to use the classifier spinlock to >avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer() >function was moved to before the lock is taken. This can create problems >for drivers if duplicate filters are created (commmon in ovs tc offload >due to filters being triggered by user-space matches). > >Drivers registered for such filters will now receive multiple copies of >the same rule, each with a different cookie value. This means that the >drivers would need to do a full match field lookup to determine >duplicates, repeating work that will happen in flower __fl_lookup(). >Currently, drivers do not expect to receive duplicate filters. > >To fix this, verify that filter with same key is not present in flower >classifier hash table and insert the new filter to the flower hash table >before offloading it to hardware. Implement helper function >fl_ht_insert_unique() to atomically verify/insert a filter. > >This change makes filter visible to fast path at the beginning of >fl_change() function, which means it can no longer be freed directly in >case of error. Refactor fl_change() error handling code to deallocate the >filter with rcu timeout. > >Fixes: 620da4860827 ("net: sched: flower: refactor fl_change") >Reported-by: John Hurley <john.hurley@netronome.com> >Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com>
From: Vlad Buslov <vladbu@mellanox.com> Date: Fri, 5 Apr 2019 20:56:26 +0300 > John reports: > > Recent refactoring of fl_change aims to use the classifier spinlock to > avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer() > function was moved to before the lock is taken. This can create problems > for drivers if duplicate filters are created (commmon in ovs tc offload > due to filters being triggered by user-space matches). > > Drivers registered for such filters will now receive multiple copies of > the same rule, each with a different cookie value. This means that the > drivers would need to do a full match field lookup to determine > duplicates, repeating work that will happen in flower __fl_lookup(). > Currently, drivers do not expect to receive duplicate filters. > > To fix this, verify that filter with same key is not present in flower > classifier hash table and insert the new filter to the flower hash table > before offloading it to hardware. Implement helper function > fl_ht_insert_unique() to atomically verify/insert a filter. > > This change makes filter visible to fast path at the beginning of > fl_change() function, which means it can no longer be freed directly in > case of error. Refactor fl_change() error handling code to deallocate the > filter with rcu timeout. > > Fixes: 620da4860827 ("net: sched: flower: refactor fl_change") > Reported-by: John Hurley <john.hurley@netronome.com> > Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Applied, thanks Vlad.
On Fri, 5 Apr 2019 20:56:26 +0300, Vlad Buslov wrote: > John reports: > > Recent refactoring of fl_change aims to use the classifier spinlock to > avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer() > function was moved to before the lock is taken. This can create problems > for drivers if duplicate filters are created (commmon in ovs tc offload > due to filters being triggered by user-space matches). > > Drivers registered for such filters will now receive multiple copies of > the same rule, each with a different cookie value. This means that the > drivers would need to do a full match field lookup to determine > duplicates, repeating work that will happen in flower __fl_lookup(). > Currently, drivers do not expect to receive duplicate filters. > > To fix this, verify that filter with same key is not present in flower > classifier hash table and insert the new filter to the flower hash table > before offloading it to hardware. Implement helper function > fl_ht_insert_unique() to atomically verify/insert a filter. > > This change makes filter visible to fast path at the beginning of > fl_change() function, which means it can no longer be freed directly in > case of error. Refactor fl_change() error handling code to deallocate the > filter with rcu timeout. > > Fixes: 620da4860827 ("net: sched: flower: refactor fl_change") > Reported-by: John Hurley <john.hurley@netronome.com> > Signed-off-by: Vlad Buslov <vladbu@mellanox.com> How is re-offload consistency guaranteed? IIUC the code is: insert into HT offload insert into IDR What guarantees re-offload consistency if new callback is added just after offload is requested but before rules ends up in IDR?
On Tue 09 Apr 2019 at 01:26, Jakub Kicinski <jakub.kicinski@netronome.com> wrote: > On Fri, 5 Apr 2019 20:56:26 +0300, Vlad Buslov wrote: >> John reports: >> >> Recent refactoring of fl_change aims to use the classifier spinlock to >> avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer() >> function was moved to before the lock is taken. This can create problems >> for drivers if duplicate filters are created (commmon in ovs tc offload >> due to filters being triggered by user-space matches). >> >> Drivers registered for such filters will now receive multiple copies of >> the same rule, each with a different cookie value. This means that the >> drivers would need to do a full match field lookup to determine >> duplicates, repeating work that will happen in flower __fl_lookup(). >> Currently, drivers do not expect to receive duplicate filters. >> >> To fix this, verify that filter with same key is not present in flower >> classifier hash table and insert the new filter to the flower hash table >> before offloading it to hardware. Implement helper function >> fl_ht_insert_unique() to atomically verify/insert a filter. >> >> This change makes filter visible to fast path at the beginning of >> fl_change() function, which means it can no longer be freed directly in >> case of error. Refactor fl_change() error handling code to deallocate the >> filter with rcu timeout. >> >> Fixes: 620da4860827 ("net: sched: flower: refactor fl_change") >> Reported-by: John Hurley <john.hurley@netronome.com> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> > > How is re-offload consistency guaranteed? IIUC the code is: > > insert into HT > offload > insert into IDR > > What guarantees re-offload consistency if new callback is added just > after offload is requested but before rules ends up in IDR? Hi Jakub, At the moment cls hardware offloads API is always called with rtnl lock, so rule can't be offloaded while reoffload is in progress. For my next patch set that unlocks the offloads API I implemented the algorithm to track reoffload count for each tp that works like this: 1. struct tcf_proto is extended with reoffload_count counter that incremented each time reoffload is called on particular tp instance. Counter is protected by tp->lock. 2. struct cls_fl_filter is also extended with reoffload_count counter. Its value is set to current tp->reoffload_count when offloading the filter. 3. After offloading the filter, but before inserting it to idr, f->reoffload_count is compared with tp->reoffload_count. If values don't match, filter is deleted and -EAGAIN is returned. Cls API retries filter insertion on -EAGAIN. Regards, Vlad
On Tue, 9 Apr 2019 08:23:40 +0000, Vlad Buslov wrote: > On Tue 09 Apr 2019 at 01:26, Jakub Kicinski <jakub.kicinski@netronome.com> wrote: > > On Fri, 5 Apr 2019 20:56:26 +0300, Vlad Buslov wrote: > >> John reports: > >> > >> Recent refactoring of fl_change aims to use the classifier spinlock to > >> avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer() > >> function was moved to before the lock is taken. This can create problems > >> for drivers if duplicate filters are created (commmon in ovs tc offload > >> due to filters being triggered by user-space matches). > >> > >> Drivers registered for such filters will now receive multiple copies of > >> the same rule, each with a different cookie value. This means that the > >> drivers would need to do a full match field lookup to determine > >> duplicates, repeating work that will happen in flower __fl_lookup(). > >> Currently, drivers do not expect to receive duplicate filters. > >> > >> To fix this, verify that filter with same key is not present in flower > >> classifier hash table and insert the new filter to the flower hash table > >> before offloading it to hardware. Implement helper function > >> fl_ht_insert_unique() to atomically verify/insert a filter. > >> > >> This change makes filter visible to fast path at the beginning of > >> fl_change() function, which means it can no longer be freed directly in > >> case of error. Refactor fl_change() error handling code to deallocate the > >> filter with rcu timeout. > >> > >> Fixes: 620da4860827 ("net: sched: flower: refactor fl_change") > >> Reported-by: John Hurley <john.hurley@netronome.com> > >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> > > > > How is re-offload consistency guaranteed? IIUC the code is: > > > > insert into HT > > offload > > insert into IDR > > > > What guarantees re-offload consistency if new callback is added just > > after offload is requested but before rules ends up in IDR? > > Hi Jakub, > > At the moment cls hardware offloads API is always called with rtnl lock, > so rule can't be offloaded while reoffload is in progress. Does that somehow imply atomicity of offloading vs inserting into IDR? Doesn't seem so from a cursory look. Or do you mean rtnl_held is always true? > For my next patch set that unlocks the offloads API I implemented the > algorithm to track reoffload count for each tp that works like this: > > 1. struct tcf_proto is extended with reoffload_count counter that > incremented each time reoffload is called on particular tp instance. > Counter is protected by tp->lock. > > 2. struct cls_fl_filter is also extended with reoffload_count counter. > Its value is set to current tp->reoffload_count when offloading the > filter. > > 3. After offloading the filter, but before inserting it to idr, > f->reoffload_count is compared with tp->reoffload_count. If values > don't match, filter is deleted and -EAGAIN is returned. Cls API > retries filter insertion on -EAGAIN. Sounds good for add. Does this solve delete case as well? CPU 0 CPU 1 __fl_delete IDR remove cb unregister hw delete all flows <- doesn't see the remove in progress hw delete <- doesn't see the removed cb
On Tue 09 Apr 2019 at 20:10, Jakub Kicinski <jakub.kicinski@netronome.com> wrote: > On Tue, 9 Apr 2019 08:23:40 +0000, Vlad Buslov wrote: >> On Tue 09 Apr 2019 at 01:26, Jakub Kicinski <jakub.kicinski@netronome.com> wrote: >> > On Fri, 5 Apr 2019 20:56:26 +0300, Vlad Buslov wrote: >> >> John reports: >> >> >> >> Recent refactoring of fl_change aims to use the classifier spinlock to >> >> avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer() >> >> function was moved to before the lock is taken. This can create problems >> >> for drivers if duplicate filters are created (commmon in ovs tc offload >> >> due to filters being triggered by user-space matches). >> >> >> >> Drivers registered for such filters will now receive multiple copies of >> >> the same rule, each with a different cookie value. This means that the >> >> drivers would need to do a full match field lookup to determine >> >> duplicates, repeating work that will happen in flower __fl_lookup(). >> >> Currently, drivers do not expect to receive duplicate filters. >> >> >> >> To fix this, verify that filter with same key is not present in flower >> >> classifier hash table and insert the new filter to the flower hash table >> >> before offloading it to hardware. Implement helper function >> >> fl_ht_insert_unique() to atomically verify/insert a filter. >> >> >> >> This change makes filter visible to fast path at the beginning of >> >> fl_change() function, which means it can no longer be freed directly in >> >> case of error. Refactor fl_change() error handling code to deallocate the >> >> filter with rcu timeout. >> >> >> >> Fixes: 620da4860827 ("net: sched: flower: refactor fl_change") >> >> Reported-by: John Hurley <john.hurley@netronome.com> >> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> >> > >> > How is re-offload consistency guaranteed? IIUC the code is: >> > >> > insert into HT >> > offload >> > insert into IDR >> > >> > What guarantees re-offload consistency if new callback is added just >> > after offload is requested but before rules ends up in IDR? >> >> Hi Jakub, >> >> At the moment cls hardware offloads API is always called with rtnl lock, >> so rule can't be offloaded while reoffload is in progress. > > Does that somehow imply atomicity of offloading vs inserting into IDR? > Doesn't seem so from a cursory look. Or do you mean rtnl_held is > always true? Sorry, I forgot that we are discussing shared block for which rtnl is not taken in tc_new_tfilter(). Now I understand the issue and will send my 'reoffload_count' implementation as a fix. > >> For my next patch set that unlocks the offloads API I implemented the >> algorithm to track reoffload count for each tp that works like this: >> >> 1. struct tcf_proto is extended with reoffload_count counter that >> incremented each time reoffload is called on particular tp instance. >> Counter is protected by tp->lock. >> >> 2. struct cls_fl_filter is also extended with reoffload_count counter. >> Its value is set to current tp->reoffload_count when offloading the >> filter. >> >> 3. After offloading the filter, but before inserting it to idr, >> f->reoffload_count is compared with tp->reoffload_count. If values >> don't match, filter is deleted and -EAGAIN is returned. Cls API >> retries filter insertion on -EAGAIN. > > Sounds good for add. Does this solve delete case as well? > > CPU 0 CPU 1 > > __fl_delete > IDR remove > cb unregister > hw delete all flows <- doesn't see the > remove in progress > > hw delete <- doesn't see > the removed cb Thanks for pointing that out! Looks like I need to move call to hw delete in __fl_delete() function to be executed before idr removal.
On Wed, 10 Apr 2019 14:53:53 +0000, Vlad Buslov wrote: > >> For my next patch set that unlocks the offloads API I implemented the > >> algorithm to track reoffload count for each tp that works like this: > >> > >> 1. struct tcf_proto is extended with reoffload_count counter that > >> incremented each time reoffload is called on particular tp instance. > >> Counter is protected by tp->lock. > >> > >> 2. struct cls_fl_filter is also extended with reoffload_count counter. > >> Its value is set to current tp->reoffload_count when offloading the > >> filter. > >> > >> 3. After offloading the filter, but before inserting it to idr, > >> f->reoffload_count is compared with tp->reoffload_count. If values > >> don't match, filter is deleted and -EAGAIN is returned. Cls API > >> retries filter insertion on -EAGAIN. > > > > Sounds good for add. Does this solve delete case as well? > > > > CPU 0 CPU 1 > > > > __fl_delete > > IDR remove > > cb unregister > > hw delete all flows <- doesn't see the > > remove in progress > > > > hw delete <- doesn't see > > the removed cb > > Thanks for pointing that out! Looks like I need to move call to hw > delete in __fl_delete() function to be executed before idr removal. Ack, plus you need to do the same retry mechanism. Save CB "count"/seq, hw delete, remove from IDR, if CB "count"/seq changed hw delete again. Right?
On Wed 10 Apr 2019 at 18:48, Jakub Kicinski <jakub.kicinski@netronome.com> wrote: > On Wed, 10 Apr 2019 14:53:53 +0000, Vlad Buslov wrote: >> >> For my next patch set that unlocks the offloads API I implemented the >> >> algorithm to track reoffload count for each tp that works like this: >> >> >> >> 1. struct tcf_proto is extended with reoffload_count counter that >> >> incremented each time reoffload is called on particular tp instance. >> >> Counter is protected by tp->lock. >> >> >> >> 2. struct cls_fl_filter is also extended with reoffload_count counter. >> >> Its value is set to current tp->reoffload_count when offloading the >> >> filter. >> >> >> >> 3. After offloading the filter, but before inserting it to idr, >> >> f->reoffload_count is compared with tp->reoffload_count. If values >> >> don't match, filter is deleted and -EAGAIN is returned. Cls API >> >> retries filter insertion on -EAGAIN. >> > >> > Sounds good for add. Does this solve delete case as well? >> > >> > CPU 0 CPU 1 >> > >> > __fl_delete >> > IDR remove >> > cb unregister >> > hw delete all flows <- doesn't see the >> > remove in progress >> > >> > hw delete <- doesn't see >> > the removed cb >> >> Thanks for pointing that out! Looks like I need to move call to hw >> delete in __fl_delete() function to be executed before idr removal. > > Ack, plus you need to do the same retry mechanism. Save CB "count"/seq, > hw delete, remove from IDR, if CB "count"/seq changed hw delete again. > Right? Actually, I intended to modify fl_reoffload() to ignore filters with 'deleted' flag set when adding, but I guess reusing 'reoffload_count' to retry fl_hw_destroy_filter() would also work.
On Wed, 10 Apr 2019 16:02:17 +0000, Vlad Buslov wrote: > On Wed 10 Apr 2019 at 18:48, Jakub Kicinski <jakub.kicinski@netronome.com> wrote: > > On Wed, 10 Apr 2019 14:53:53 +0000, Vlad Buslov wrote: > >> >> For my next patch set that unlocks the offloads API I implemented the > >> >> algorithm to track reoffload count for each tp that works like this: > >> >> > >> >> 1. struct tcf_proto is extended with reoffload_count counter that > >> >> incremented each time reoffload is called on particular tp instance. > >> >> Counter is protected by tp->lock. > >> >> > >> >> 2. struct cls_fl_filter is also extended with reoffload_count counter. > >> >> Its value is set to current tp->reoffload_count when offloading the > >> >> filter. > >> >> > >> >> 3. After offloading the filter, but before inserting it to idr, > >> >> f->reoffload_count is compared with tp->reoffload_count. If values > >> >> don't match, filter is deleted and -EAGAIN is returned. Cls API > >> >> retries filter insertion on -EAGAIN. > >> > > >> > Sounds good for add. Does this solve delete case as well? > >> > > >> > CPU 0 CPU 1 > >> > > >> > __fl_delete > >> > IDR remove > >> > cb unregister > >> > hw delete all flows <- doesn't see the > >> > remove in progress > >> > > >> > hw delete <- doesn't see > >> > the removed cb > >> > >> Thanks for pointing that out! Looks like I need to move call to hw > >> delete in __fl_delete() function to be executed before idr removal. > > > > Ack, plus you need to do the same retry mechanism. Save CB "count"/seq, > > hw delete, remove from IDR, if CB "count"/seq changed hw delete again. > > Right? > > Actually, I intended to modify fl_reoffload() to ignore filters with > 'deleted' flag set when adding, but I guess reusing 'reoffload_count' to > retry fl_hw_destroy_filter() would also work. Yeah, I don't see how you can ignore deleted safely. Perhaps lack of coffee :)
On Wed 10 Apr 2019 at 19:09, Jakub Kicinski <jakub.kicinski@netronome.com> wrote: > On Wed, 10 Apr 2019 16:02:17 +0000, Vlad Buslov wrote: >> On Wed 10 Apr 2019 at 18:48, Jakub Kicinski <jakub.kicinski@netronome.com> wrote: >> > On Wed, 10 Apr 2019 14:53:53 +0000, Vlad Buslov wrote: >> >> >> For my next patch set that unlocks the offloads API I implemented the >> >> >> algorithm to track reoffload count for each tp that works like this: >> >> >> >> >> >> 1. struct tcf_proto is extended with reoffload_count counter that >> >> >> incremented each time reoffload is called on particular tp instance. >> >> >> Counter is protected by tp->lock. >> >> >> >> >> >> 2. struct cls_fl_filter is also extended with reoffload_count counter. >> >> >> Its value is set to current tp->reoffload_count when offloading the >> >> >> filter. >> >> >> >> >> >> 3. After offloading the filter, but before inserting it to idr, >> >> >> f->reoffload_count is compared with tp->reoffload_count. If values >> >> >> don't match, filter is deleted and -EAGAIN is returned. Cls API >> >> >> retries filter insertion on -EAGAIN. >> >> > >> >> > Sounds good for add. Does this solve delete case as well? >> >> > >> >> > CPU 0 CPU 1 >> >> > >> >> > __fl_delete >> >> > IDR remove >> >> > cb unregister >> >> > hw delete all flows <- doesn't see the >> >> > remove in progress >> >> > >> >> > hw delete <- doesn't see >> >> > the removed cb >> >> >> >> Thanks for pointing that out! Looks like I need to move call to hw >> >> delete in __fl_delete() function to be executed before idr removal. >> > >> > Ack, plus you need to do the same retry mechanism. Save CB "count"/seq, >> > hw delete, remove from IDR, if CB "count"/seq changed hw delete again. >> > Right? >> >> Actually, I intended to modify fl_reoffload() to ignore filters with >> 'deleted' flag set when adding, but I guess reusing 'reoffload_count' to >> retry fl_hw_destroy_filter() would also work. > > Yeah, I don't see how you can ignore deleted safely. Perhaps lack of > coffee :) Well, drivers are supposed to account for double deletion or deletion of filters that were not successfully offloaded to them. If filter is not marked as skip_sw, its creation will succeed even if hw callbacks have failed, but __fl_delete() still calls fl_hw_destroy_filter() on such filters. The main thing is that we must guarantee that code doesn't delete a new filter with same key. However, in case of flower classifier 'cookie' is pointer to filter, and filter is freed only when last reference to it is released, so code is safe in this regard. So I guess there is nothing wrong with reoffload calling cb() on all classifier filters (including marked as 'deleted'), if delete code doesn't miss any of the callbacks afterwards.
On Wed, 10 Apr 2019 16:26:38 +0000, Vlad Buslov wrote: > >> Actually, I intended to modify fl_reoffload() to ignore filters with > >> 'deleted' flag set when adding, but I guess reusing 'reoffload_count' to > >> retry fl_hw_destroy_filter() would also work. > > > > Yeah, I don't see how you can ignore deleted safely. Perhaps lack of > > coffee :) > > Well, drivers are supposed to account for double deletion or deletion of > filters that were not successfully offloaded to them. If filter is not > marked as skip_sw, its creation will succeed even if hw callbacks have > failed, but __fl_delete() still calls fl_hw_destroy_filter() on such > filters. The main thing is that we must guarantee that code doesn't > delete a new filter with same key. However, in case of flower classifier > 'cookie' is pointer to filter, and filter is freed only when last > reference to it is released, so code is safe in this regard. > > So I guess there is nothing wrong with reoffload calling cb() > on all classifier filters (including marked as 'deleted'), if delete > code doesn't miss any of the callbacks afterwards. Yup. Plus multiple/spurious deletes are already a fact of life, since we don't keep track of which callback accepted the filter and which didn't.
On Fri, Apr 05, 2019 at 08:56:26PM +0300, Vlad Buslov wrote: > John reports: > > Recent refactoring of fl_change aims to use the classifier spinlock to > avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer() > function was moved to before the lock is taken. This can create problems > for drivers if duplicate filters are created (commmon in ovs tc offload > due to filters being triggered by user-space matches). > > Drivers registered for such filters will now receive multiple copies of > the same rule, each with a different cookie value. This means that the > drivers would need to do a full match field lookup to determine > duplicates, repeating work that will happen in flower __fl_lookup(). > Currently, drivers do not expect to receive duplicate filters. > > To fix this, verify that filter with same key is not present in flower > classifier hash table and insert the new filter to the flower hash table > before offloading it to hardware. Implement helper function > fl_ht_insert_unique() to atomically verify/insert a filter. > > This change makes filter visible to fast path at the beginning of > fl_change() function, which means it can no longer be freed directly in > case of error. Refactor fl_change() error handling code to deallocate the > filter with rcu timeout. > > Fixes: 620da4860827 ("net: sched: flower: refactor fl_change") > Reported-by: John Hurley <john.hurley@netronome.com> > Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Vlad, Our regression machines all hit a NULL pointer dereference [1] which I bisected to this patch. Created this reproducer that you can use: ip netns add ns1 ip -n ns1 link add dev dummy1 type dummy tc -n ns1 qdisc add dev dummy1 clsact tc -n ns1 filter add dev dummy1 ingress pref 1 proto ip \ flower skip_hw src_ip 192.0.2.1 action drop ip netns del ns1 Can you please look into this? Thanks [1] [ 5.332176] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 [ 5.334372] #PF error: [normal kernel read fault] [ 5.335619] PGD 0 P4D 0 [ 5.336360] Oops: 0000 [#1] SMP [ 5.337249] CPU: 0 PID: 7 Comm: kworker/u2:0 Not tainted 5.1.0-rc4-custom-01473-g526bb57a6ad6 #1374 [ 5.339232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014 [ 5.341982] Workqueue: netns cleanup_net [ 5.342843] RIP: 0010:__fl_put+0x24/0xb0 [ 5.343808] Code: 84 00 00 00 00 00 3e ff 8f f0 03 00 00 0f 88 da 7b 14 00 74 01 c3 80 bf f4 03 00 00 00 0f 84 83 00 00 00 4c 8b 87 c8 01 00 00 <41> 8b 50 04 49 8d 70 04 85 d2 74 60 8d 4a 01 39 ca 7f 52 81 fa fe [ 5.348099] RSP: 0018:ffffabe300663be0 EFLAGS: 00010202 [ 5.349223] RAX: ffff9ea4ba1aff00 RBX: ffff9ea4b99af400 RCX: ffffabe300663c67 [ 5.350572] RDX: 00000000000004c5 RSI: 0000000000000000 RDI: ffff9ea4b99af400 [ 5.351919] RBP: ffff9ea4ba28e900 R08: 0000000000000000 R09: ffffffff9d1b0075 [ 5.353272] R10: ffffeb2884e66b80 R11: ffffffff9dc4dcd8 R12: ffff9ea4b99af408 [ 5.354635] R13: ffff9ea4b99ae400 R14: ffff9ea4b9a47800 R15: ffff9ea4b99ae000 [ 5.355988] FS: 0000000000000000(0000) GS:ffff9ea4bba00000(0000) knlGS:0000000000000000 [ 5.357436] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.358530] CR2: 0000000000000004 CR3: 00000001398fa004 CR4: 0000000000160ef0 [ 5.359876] Call Trace: [ 5.360360] __fl_delete+0x223/0x3b0 [ 5.361008] fl_destroy+0xb4/0x130 [ 5.361641] tcf_proto_destroy+0x15/0x40 [ 5.362429] tcf_chain_flush+0x4e/0x60 [ 5.363125] __tcf_block_put+0xb4/0x150 [ 5.363805] clsact_destroy+0x30/0x40 [ 5.364507] qdisc_destroy+0x44/0x110 [ 5.365218] dev_shutdown+0x6e/0xa0 [ 5.365821] rollback_registered_many+0x25d/0x510 [ 5.366724] ? netdev_run_todo+0x221/0x280 [ 5.367485] unregister_netdevice_many+0x15/0xa0 [ 5.368355] default_device_exit_batch+0x13f/0x170 [ 5.369268] ? wait_woken+0x80/0x80 [ 5.369910] cleanup_net+0x19a/0x280 [ 5.370558] process_one_work+0x1f5/0x3f0 [ 5.371326] worker_thread+0x28/0x3c0 [ 5.372038] ? process_one_work+0x3f0/0x3f0 [ 5.372755] kthread+0x10d/0x130 [ 5.373358] ? __kthread_create_on_node+0x180/0x180 [ 5.374298] ret_from_fork+0x35/0x40 [ 5.374934] CR2: 0000000000000004 [ 5.375454] ---[ end trace c20e7f74127772e5 ]--- [ 5.376284] RIP: 0010:__fl_put+0x24/0xb0 [ 5.377003] Code: 84 00 00 00 00 00 3e ff 8f f0 03 00 00 0f 88 da 7b 14 00 74 01 c3 80 bf f4 03 00 00 00 0f 84 83 00 00 00 4c 8b 87 c8 01 00 00 <41> 8b 50 04 49 8d 70 04 85 d2 74 60 8d 4a 01 39 ca 7f 52 81 fa fe [ 5.380269] RSP: 0018:ffffabe300663be0 EFLAGS: 00010202 [ 5.381237] RAX: ffff9ea4ba1aff00 RBX: ffff9ea4b99af400 RCX: ffffabe300663c67 [ 5.382551] RDX: 00000000000004c5 RSI: 0000000000000000 RDI: ffff9ea4b99af400 [ 5.383972] RBP: ffff9ea4ba28e900 R08: 0000000000000000 R09: ffffffff9d1b0075 [ 5.385314] R10: ffffeb2884e66b80 R11: ffffffff9dc4dcd8 R12: ffff9ea4b99af408 [ 5.386616] R13: ffff9ea4b99ae400 R14: ffff9ea4b9a47800 R15: ffff9ea4b99ae000 [ 5.387986] FS: 0000000000000000(0000) GS:ffff9ea4bba00000(0000) knlGS:0000000000000000 [ 5.389512] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.390546] CR2: 0000000000000004 CR3: 00000001398fa004 CR4: 0000000000160ef0
On Thu 11 Apr 2019 at 14:13, Ido Schimmel <idosch@idosch.org> wrote: > On Fri, Apr 05, 2019 at 08:56:26PM +0300, Vlad Buslov wrote: >> John reports: >> >> Recent refactoring of fl_change aims to use the classifier spinlock to >> avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer() >> function was moved to before the lock is taken. This can create problems >> for drivers if duplicate filters are created (commmon in ovs tc offload >> due to filters being triggered by user-space matches). >> >> Drivers registered for such filters will now receive multiple copies of >> the same rule, each with a different cookie value. This means that the >> drivers would need to do a full match field lookup to determine >> duplicates, repeating work that will happen in flower __fl_lookup(). >> Currently, drivers do not expect to receive duplicate filters. >> >> To fix this, verify that filter with same key is not present in flower >> classifier hash table and insert the new filter to the flower hash table >> before offloading it to hardware. Implement helper function >> fl_ht_insert_unique() to atomically verify/insert a filter. >> >> This change makes filter visible to fast path at the beginning of >> fl_change() function, which means it can no longer be freed directly in >> case of error. Refactor fl_change() error handling code to deallocate the >> filter with rcu timeout. >> >> Fixes: 620da4860827 ("net: sched: flower: refactor fl_change") >> Reported-by: John Hurley <john.hurley@netronome.com> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> > > Vlad, > > Our regression machines all hit a NULL pointer dereference [1] which I > bisected to this patch. Created this reproducer that you can use: > > ip netns add ns1 > ip -n ns1 link add dev dummy1 type dummy > tc -n ns1 qdisc add dev dummy1 clsact > tc -n ns1 filter add dev dummy1 ingress pref 1 proto ip \ > flower skip_hw src_ip 192.0.2.1 action drop > ip netns del ns1 > > Can you please look into this? Thanks Will do.
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 6050e3caee31..2763176e369c 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -1459,6 +1459,28 @@ static int fl_set_parms(struct net *net, struct tcf_proto *tp, return 0; } +static int fl_ht_insert_unique(struct cls_fl_filter *fnew, + struct cls_fl_filter *fold, + bool *in_ht) +{ + struct fl_flow_mask *mask = fnew->mask; + int err; + + err = rhashtable_insert_fast(&mask->ht, + &fnew->ht_node, + mask->filter_ht_params); + if (err) { + *in_ht = false; + /* It is okay if filter with same key exists when + * overwriting. + */ + return fold && err == -EEXIST ? 0 : err; + } + + *in_ht = true; + return 0; +} + static int fl_change(struct net *net, struct sk_buff *in_skb, struct tcf_proto *tp, unsigned long base, u32 handle, struct nlattr **tca, @@ -1470,6 +1492,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb, struct cls_fl_filter *fnew; struct fl_flow_mask *mask; struct nlattr **tb; + bool in_ht; int err; if (!tca[TCA_OPTIONS]) { @@ -1528,10 +1551,14 @@ static int fl_change(struct net *net, struct sk_buff *in_skb, if (err) goto errout; + err = fl_ht_insert_unique(fnew, fold, &in_ht); + if (err) + goto errout_mask; + if (!tc_skip_hw(fnew->flags)) { err = fl_hw_replace_filter(tp, fnew, rtnl_held, extack); if (err) - goto errout_mask; + goto errout_ht; } if (!tc_in_hw(fnew->flags)) @@ -1557,10 +1584,17 @@ static int fl_change(struct net *net, struct sk_buff *in_skb, fnew->handle = handle; - err = rhashtable_insert_fast(&fnew->mask->ht, &fnew->ht_node, - fnew->mask->filter_ht_params); - if (err) - goto errout_hw; + if (!in_ht) { + struct rhashtable_params params = + fnew->mask->filter_ht_params; + + err = rhashtable_insert_fast(&fnew->mask->ht, + &fnew->ht_node, + params); + if (err) + goto errout_hw; + in_ht = true; + } rhashtable_remove_fast(&fold->mask->ht, &fold->ht_node, @@ -1582,11 +1616,6 @@ static int fl_change(struct net *net, struct sk_buff *in_skb, refcount_dec(&fold->refcnt); __fl_put(fold); } else { - if (__fl_lookup(fnew->mask, &fnew->mkey)) { - err = -EEXIST; - goto errout_hw; - } - if (handle) { /* user specifies a handle and it doesn't exist */ err = idr_alloc_u32(&head->handle_idr, fnew, &handle, @@ -1609,12 +1638,6 @@ static int fl_change(struct net *net, struct sk_buff *in_skb, goto errout_hw; fnew->handle = handle; - - err = rhashtable_insert_fast(&fnew->mask->ht, &fnew->ht_node, - fnew->mask->filter_ht_params); - if (err) - goto errout_idr; - list_add_tail_rcu(&fnew->list, &fnew->mask->filters); spin_unlock(&tp->lock); } @@ -1625,17 +1648,18 @@ static int fl_change(struct net *net, struct sk_buff *in_skb, kfree(mask); return 0; -errout_idr: - idr_remove(&head->handle_idr, fnew->handle); errout_hw: spin_unlock(&tp->lock); if (!tc_skip_hw(fnew->flags)) fl_hw_destroy_filter(tp, fnew, rtnl_held, NULL); +errout_ht: + if (in_ht) + rhashtable_remove_fast(&fnew->mask->ht, &fnew->ht_node, + fnew->mask->filter_ht_params); errout_mask: fl_mask_put(head, fnew->mask, true); errout: - tcf_exts_destroy(&fnew->exts); - kfree(fnew); + tcf_queue_work(&fnew->rwork, fl_destroy_filter_work); errout_tb: kfree(tb); errout_mask_alloc:
John reports: Recent refactoring of fl_change aims to use the classifier spinlock to avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer() function was moved to before the lock is taken. This can create problems for drivers if duplicate filters are created (commmon in ovs tc offload due to filters being triggered by user-space matches). Drivers registered for such filters will now receive multiple copies of the same rule, each with a different cookie value. This means that the drivers would need to do a full match field lookup to determine duplicates, repeating work that will happen in flower __fl_lookup(). Currently, drivers do not expect to receive duplicate filters. To fix this, verify that filter with same key is not present in flower classifier hash table and insert the new filter to the flower hash table before offloading it to hardware. Implement helper function fl_ht_insert_unique() to atomically verify/insert a filter. This change makes filter visible to fast path at the beginning of fl_change() function, which means it can no longer be freed directly in case of error. Refactor fl_change() error handling code to deallocate the filter with rcu timeout. Fixes: 620da4860827 ("net: sched: flower: refactor fl_change") Reported-by: John Hurley <john.hurley@netronome.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> --- net/sched/cls_flower.c | 64 +++++++++++++++++++++++++++++------------- 1 file changed, 44 insertions(+), 20 deletions(-)