diff mbox series

[net-next] net: sched: flower: insert filter to ht before offloading it to hw

Message ID 20190405175626.4123-1-vladbu@mellanox.com
State Accepted
Delegated to: David Miller
Headers show
Series [net-next] net: sched: flower: insert filter to ht before offloading it to hw | expand

Commit Message

Vlad Buslov April 5, 2019, 5:56 p.m. UTC
John reports:

Recent refactoring of fl_change aims to use the classifier spinlock to
avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
function was moved to before the lock is taken. This can create problems
for drivers if duplicate filters are created (commmon in ovs tc offload
due to filters being triggered by user-space matches).

Drivers registered for such filters will now receive multiple copies of
the same rule, each with a different cookie value. This means that the
drivers would need to do a full match field lookup to determine
duplicates, repeating work that will happen in flower __fl_lookup().
Currently, drivers do not expect to receive duplicate filters.

To fix this, verify that filter with same key is not present in flower
classifier hash table and insert the new filter to the flower hash table
before offloading it to hardware. Implement helper function
fl_ht_insert_unique() to atomically verify/insert a filter.

This change makes filter visible to fast path at the beginning of
fl_change() function, which means it can no longer be freed directly in
case of error. Refactor fl_change() error handling code to deallocate the
filter with rcu timeout.

Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
Reported-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
 net/sched/cls_flower.c | 64 +++++++++++++++++++++++++++++-------------
 1 file changed, 44 insertions(+), 20 deletions(-)

Comments

Jiri Pirko April 6, 2019, 5:59 a.m. UTC | #1
Fri, Apr 05, 2019 at 07:56:26PM CEST, vladbu@mellanox.com wrote:
>John reports:
>
>Recent refactoring of fl_change aims to use the classifier spinlock to
>avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
>function was moved to before the lock is taken. This can create problems
>for drivers if duplicate filters are created (commmon in ovs tc offload
>due to filters being triggered by user-space matches).
>
>Drivers registered for such filters will now receive multiple copies of
>the same rule, each with a different cookie value. This means that the
>drivers would need to do a full match field lookup to determine
>duplicates, repeating work that will happen in flower __fl_lookup().
>Currently, drivers do not expect to receive duplicate filters.
>
>To fix this, verify that filter with same key is not present in flower
>classifier hash table and insert the new filter to the flower hash table
>before offloading it to hardware. Implement helper function
>fl_ht_insert_unique() to atomically verify/insert a filter.
>
>This change makes filter visible to fast path at the beginning of
>fl_change() function, which means it can no longer be freed directly in
>case of error. Refactor fl_change() error handling code to deallocate the
>filter with rcu timeout.
>
>Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
>Reported-by: John Hurley <john.hurley@netronome.com>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>
David Miller April 8, 2019, 2:34 a.m. UTC | #2
From: Vlad Buslov <vladbu@mellanox.com>
Date: Fri,  5 Apr 2019 20:56:26 +0300

> John reports:
> 
> Recent refactoring of fl_change aims to use the classifier spinlock to
> avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
> function was moved to before the lock is taken. This can create problems
> for drivers if duplicate filters are created (commmon in ovs tc offload
> due to filters being triggered by user-space matches).
> 
> Drivers registered for such filters will now receive multiple copies of
> the same rule, each with a different cookie value. This means that the
> drivers would need to do a full match field lookup to determine
> duplicates, repeating work that will happen in flower __fl_lookup().
> Currently, drivers do not expect to receive duplicate filters.
> 
> To fix this, verify that filter with same key is not present in flower
> classifier hash table and insert the new filter to the flower hash table
> before offloading it to hardware. Implement helper function
> fl_ht_insert_unique() to atomically verify/insert a filter.
> 
> This change makes filter visible to fast path at the beginning of
> fl_change() function, which means it can no longer be freed directly in
> case of error. Refactor fl_change() error handling code to deallocate the
> filter with rcu timeout.
> 
> Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
> Reported-by: John Hurley <john.hurley@netronome.com>
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Applied, thanks Vlad.
Jakub Kicinski April 8, 2019, 10:26 p.m. UTC | #3
On Fri,  5 Apr 2019 20:56:26 +0300, Vlad Buslov wrote:
> John reports:
> 
> Recent refactoring of fl_change aims to use the classifier spinlock to
> avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
> function was moved to before the lock is taken. This can create problems
> for drivers if duplicate filters are created (commmon in ovs tc offload
> due to filters being triggered by user-space matches).
> 
> Drivers registered for such filters will now receive multiple copies of
> the same rule, each with a different cookie value. This means that the
> drivers would need to do a full match field lookup to determine
> duplicates, repeating work that will happen in flower __fl_lookup().
> Currently, drivers do not expect to receive duplicate filters.
> 
> To fix this, verify that filter with same key is not present in flower
> classifier hash table and insert the new filter to the flower hash table
> before offloading it to hardware. Implement helper function
> fl_ht_insert_unique() to atomically verify/insert a filter.
> 
> This change makes filter visible to fast path at the beginning of
> fl_change() function, which means it can no longer be freed directly in
> case of error. Refactor fl_change() error handling code to deallocate the
> filter with rcu timeout.
> 
> Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
> Reported-by: John Hurley <john.hurley@netronome.com>
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

How is re-offload consistency guaranteed?  IIUC the code is:

 insert into HT
 offload
 insert into IDR

What guarantees re-offload consistency if new callback is added just
after offload is requested but before rules ends up in IDR?
Vlad Buslov April 9, 2019, 8:23 a.m. UTC | #4
On Tue 09 Apr 2019 at 01:26, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> On Fri,  5 Apr 2019 20:56:26 +0300, Vlad Buslov wrote:
>> John reports:
>>
>> Recent refactoring of fl_change aims to use the classifier spinlock to
>> avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
>> function was moved to before the lock is taken. This can create problems
>> for drivers if duplicate filters are created (commmon in ovs tc offload
>> due to filters being triggered by user-space matches).
>>
>> Drivers registered for such filters will now receive multiple copies of
>> the same rule, each with a different cookie value. This means that the
>> drivers would need to do a full match field lookup to determine
>> duplicates, repeating work that will happen in flower __fl_lookup().
>> Currently, drivers do not expect to receive duplicate filters.
>>
>> To fix this, verify that filter with same key is not present in flower
>> classifier hash table and insert the new filter to the flower hash table
>> before offloading it to hardware. Implement helper function
>> fl_ht_insert_unique() to atomically verify/insert a filter.
>>
>> This change makes filter visible to fast path at the beginning of
>> fl_change() function, which means it can no longer be freed directly in
>> case of error. Refactor fl_change() error handling code to deallocate the
>> filter with rcu timeout.
>>
>> Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
>> Reported-by: John Hurley <john.hurley@netronome.com>
>> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>
> How is re-offload consistency guaranteed?  IIUC the code is:
>
>  insert into HT
>  offload
>  insert into IDR
>
> What guarantees re-offload consistency if new callback is added just
> after offload is requested but before rules ends up in IDR?

Hi Jakub,

At the moment cls hardware offloads API is always called with rtnl lock,
so rule can't be offloaded while reoffload is in progress.

For my next patch set that unlocks the offloads API I implemented the
algorithm to track reoffload count for each tp that works like this:

1. struct tcf_proto is extended with reoffload_count counter that
   incremented each time reoffload is called on particular tp instance.
   Counter is protected by tp->lock.

2. struct cls_fl_filter is also extended with reoffload_count counter.
   Its value is set to current tp->reoffload_count when offloading the
   filter.

3. After offloading the filter, but before inserting it to idr,
   f->reoffload_count is compared with tp->reoffload_count. If values
   don't match, filter is deleted and -EAGAIN is returned. Cls API
   retries filter insertion on -EAGAIN.

Regards,
Vlad
Jakub Kicinski April 9, 2019, 5:10 p.m. UTC | #5
On Tue, 9 Apr 2019 08:23:40 +0000, Vlad Buslov wrote:
> On Tue 09 Apr 2019 at 01:26, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> > On Fri,  5 Apr 2019 20:56:26 +0300, Vlad Buslov wrote:  
> >> John reports:
> >>
> >> Recent refactoring of fl_change aims to use the classifier spinlock to
> >> avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
> >> function was moved to before the lock is taken. This can create problems
> >> for drivers if duplicate filters are created (commmon in ovs tc offload
> >> due to filters being triggered by user-space matches).
> >>
> >> Drivers registered for such filters will now receive multiple copies of
> >> the same rule, each with a different cookie value. This means that the
> >> drivers would need to do a full match field lookup to determine
> >> duplicates, repeating work that will happen in flower __fl_lookup().
> >> Currently, drivers do not expect to receive duplicate filters.
> >>
> >> To fix this, verify that filter with same key is not present in flower
> >> classifier hash table and insert the new filter to the flower hash table
> >> before offloading it to hardware. Implement helper function
> >> fl_ht_insert_unique() to atomically verify/insert a filter.
> >>
> >> This change makes filter visible to fast path at the beginning of
> >> fl_change() function, which means it can no longer be freed directly in
> >> case of error. Refactor fl_change() error handling code to deallocate the
> >> filter with rcu timeout.
> >>
> >> Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
> >> Reported-by: John Hurley <john.hurley@netronome.com>
> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>  
> >
> > How is re-offload consistency guaranteed?  IIUC the code is:
> >
> >  insert into HT
> >  offload
> >  insert into IDR
> >
> > What guarantees re-offload consistency if new callback is added just
> > after offload is requested but before rules ends up in IDR?  
> 
> Hi Jakub,
> 
> At the moment cls hardware offloads API is always called with rtnl lock,
> so rule can't be offloaded while reoffload is in progress.

Does that somehow imply atomicity of offloading vs inserting into IDR?
Doesn't seem so from a cursory look.  Or do you mean rtnl_held is
always true?

> For my next patch set that unlocks the offloads API I implemented the
> algorithm to track reoffload count for each tp that works like this:
> 
> 1. struct tcf_proto is extended with reoffload_count counter that
>    incremented each time reoffload is called on particular tp instance.
>    Counter is protected by tp->lock.
> 
> 2. struct cls_fl_filter is also extended with reoffload_count counter.
>    Its value is set to current tp->reoffload_count when offloading the
>    filter.
> 
> 3. After offloading the filter, but before inserting it to idr,
>    f->reoffload_count is compared with tp->reoffload_count. If values
>    don't match, filter is deleted and -EAGAIN is returned. Cls API
>    retries filter insertion on -EAGAIN.

Sounds good for add.  Does this solve delete case as well?

   CPU 0                       CPU 1

__fl_delete
  IDR remove
                           cb unregister
                             hw delete all flows  <- doesn't see the
                                                     remove in progress

  hw delete  <- doesn't see 
                the removed cb
Vlad Buslov April 10, 2019, 2:53 p.m. UTC | #6
On Tue 09 Apr 2019 at 20:10, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> On Tue, 9 Apr 2019 08:23:40 +0000, Vlad Buslov wrote:
>> On Tue 09 Apr 2019 at 01:26, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
>> > On Fri,  5 Apr 2019 20:56:26 +0300, Vlad Buslov wrote:
>> >> John reports:
>> >>
>> >> Recent refactoring of fl_change aims to use the classifier spinlock to
>> >> avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
>> >> function was moved to before the lock is taken. This can create problems
>> >> for drivers if duplicate filters are created (commmon in ovs tc offload
>> >> due to filters being triggered by user-space matches).
>> >>
>> >> Drivers registered for such filters will now receive multiple copies of
>> >> the same rule, each with a different cookie value. This means that the
>> >> drivers would need to do a full match field lookup to determine
>> >> duplicates, repeating work that will happen in flower __fl_lookup().
>> >> Currently, drivers do not expect to receive duplicate filters.
>> >>
>> >> To fix this, verify that filter with same key is not present in flower
>> >> classifier hash table and insert the new filter to the flower hash table
>> >> before offloading it to hardware. Implement helper function
>> >> fl_ht_insert_unique() to atomically verify/insert a filter.
>> >>
>> >> This change makes filter visible to fast path at the beginning of
>> >> fl_change() function, which means it can no longer be freed directly in
>> >> case of error. Refactor fl_change() error handling code to deallocate the
>> >> filter with rcu timeout.
>> >>
>> >> Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
>> >> Reported-by: John Hurley <john.hurley@netronome.com>
>> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> >
>> > How is re-offload consistency guaranteed?  IIUC the code is:
>> >
>> >  insert into HT
>> >  offload
>> >  insert into IDR
>> >
>> > What guarantees re-offload consistency if new callback is added just
>> > after offload is requested but before rules ends up in IDR?
>>
>> Hi Jakub,
>>
>> At the moment cls hardware offloads API is always called with rtnl lock,
>> so rule can't be offloaded while reoffload is in progress.
>
> Does that somehow imply atomicity of offloading vs inserting into IDR?
> Doesn't seem so from a cursory look.  Or do you mean rtnl_held is
> always true?

Sorry, I forgot that we are discussing shared block for which rtnl is
not taken in tc_new_tfilter(). Now I understand the issue and will send
my 'reoffload_count' implementation as a fix.

>
>> For my next patch set that unlocks the offloads API I implemented the
>> algorithm to track reoffload count for each tp that works like this:
>>
>> 1. struct tcf_proto is extended with reoffload_count counter that
>>    incremented each time reoffload is called on particular tp instance.
>>    Counter is protected by tp->lock.
>>
>> 2. struct cls_fl_filter is also extended with reoffload_count counter.
>>    Its value is set to current tp->reoffload_count when offloading the
>>    filter.
>>
>> 3. After offloading the filter, but before inserting it to idr,
>>    f->reoffload_count is compared with tp->reoffload_count. If values
>>    don't match, filter is deleted and -EAGAIN is returned. Cls API
>>    retries filter insertion on -EAGAIN.
>
> Sounds good for add.  Does this solve delete case as well?
>
>    CPU 0                       CPU 1
>
> __fl_delete
>   IDR remove
>                            cb unregister
>                              hw delete all flows  <- doesn't see the
>                                                      remove in progress
>
>   hw delete  <- doesn't see
>                 the removed cb

Thanks for pointing that out! Looks like I need to move call to hw
delete in __fl_delete() function to be executed before idr removal.
Jakub Kicinski April 10, 2019, 3:48 p.m. UTC | #7
On Wed, 10 Apr 2019 14:53:53 +0000, Vlad Buslov wrote:
> >> For my next patch set that unlocks the offloads API I implemented the
> >> algorithm to track reoffload count for each tp that works like this:
> >>
> >> 1. struct tcf_proto is extended with reoffload_count counter that
> >>    incremented each time reoffload is called on particular tp instance.
> >>    Counter is protected by tp->lock.
> >>
> >> 2. struct cls_fl_filter is also extended with reoffload_count counter.
> >>    Its value is set to current tp->reoffload_count when offloading the
> >>    filter.
> >>
> >> 3. After offloading the filter, but before inserting it to idr,
> >>    f->reoffload_count is compared with tp->reoffload_count. If values
> >>    don't match, filter is deleted and -EAGAIN is returned. Cls API
> >>    retries filter insertion on -EAGAIN.  
> >
> > Sounds good for add.  Does this solve delete case as well?
> >
> >    CPU 0                       CPU 1
> >
> > __fl_delete
> >   IDR remove
> >                            cb unregister
> >                              hw delete all flows  <- doesn't see the
> >                                                      remove in progress
> >
> >   hw delete  <- doesn't see
> >                 the removed cb  
> 
> Thanks for pointing that out! Looks like I need to move call to hw
> delete in __fl_delete() function to be executed before idr removal.

Ack, plus you need to do the same retry mechanism.  Save CB "count"/seq,
hw delete, remove from IDR, if CB "count"/seq changed hw delete again.
Right?
Vlad Buslov April 10, 2019, 4:02 p.m. UTC | #8
On Wed 10 Apr 2019 at 18:48, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> On Wed, 10 Apr 2019 14:53:53 +0000, Vlad Buslov wrote:
>> >> For my next patch set that unlocks the offloads API I implemented the
>> >> algorithm to track reoffload count for each tp that works like this:
>> >>
>> >> 1. struct tcf_proto is extended with reoffload_count counter that
>> >>    incremented each time reoffload is called on particular tp instance.
>> >>    Counter is protected by tp->lock.
>> >>
>> >> 2. struct cls_fl_filter is also extended with reoffload_count counter.
>> >>    Its value is set to current tp->reoffload_count when offloading the
>> >>    filter.
>> >>
>> >> 3. After offloading the filter, but before inserting it to idr,
>> >>    f->reoffload_count is compared with tp->reoffload_count. If values
>> >>    don't match, filter is deleted and -EAGAIN is returned. Cls API
>> >>    retries filter insertion on -EAGAIN.
>> >
>> > Sounds good for add.  Does this solve delete case as well?
>> >
>> >    CPU 0                       CPU 1
>> >
>> > __fl_delete
>> >   IDR remove
>> >                            cb unregister
>> >                              hw delete all flows  <- doesn't see the
>> >                                                      remove in progress
>> >
>> >   hw delete  <- doesn't see
>> >                 the removed cb
>>
>> Thanks for pointing that out! Looks like I need to move call to hw
>> delete in __fl_delete() function to be executed before idr removal.
>
> Ack, plus you need to do the same retry mechanism.  Save CB "count"/seq,
> hw delete, remove from IDR, if CB "count"/seq changed hw delete again.
> Right?

Actually, I intended to modify fl_reoffload() to ignore filters with
'deleted' flag set when adding, but I guess reusing 'reoffload_count' to
retry fl_hw_destroy_filter() would also work.
Jakub Kicinski April 10, 2019, 4:09 p.m. UTC | #9
On Wed, 10 Apr 2019 16:02:17 +0000, Vlad Buslov wrote:
> On Wed 10 Apr 2019 at 18:48, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> > On Wed, 10 Apr 2019 14:53:53 +0000, Vlad Buslov wrote:  
> >> >> For my next patch set that unlocks the offloads API I implemented the
> >> >> algorithm to track reoffload count for each tp that works like this:
> >> >>
> >> >> 1. struct tcf_proto is extended with reoffload_count counter that
> >> >>    incremented each time reoffload is called on particular tp instance.
> >> >>    Counter is protected by tp->lock.
> >> >>
> >> >> 2. struct cls_fl_filter is also extended with reoffload_count counter.
> >> >>    Its value is set to current tp->reoffload_count when offloading the
> >> >>    filter.
> >> >>
> >> >> 3. After offloading the filter, but before inserting it to idr,
> >> >>    f->reoffload_count is compared with tp->reoffload_count. If values
> >> >>    don't match, filter is deleted and -EAGAIN is returned. Cls API
> >> >>    retries filter insertion on -EAGAIN.  
> >> >
> >> > Sounds good for add.  Does this solve delete case as well?
> >> >
> >> >    CPU 0                       CPU 1
> >> >
> >> > __fl_delete
> >> >   IDR remove
> >> >                            cb unregister
> >> >                              hw delete all flows  <- doesn't see the
> >> >                                                      remove in progress
> >> >
> >> >   hw delete  <- doesn't see
> >> >                 the removed cb  
> >>
> >> Thanks for pointing that out! Looks like I need to move call to hw
> >> delete in __fl_delete() function to be executed before idr removal.  
> >
> > Ack, plus you need to do the same retry mechanism.  Save CB "count"/seq,
> > hw delete, remove from IDR, if CB "count"/seq changed hw delete again.
> > Right?  
> 
> Actually, I intended to modify fl_reoffload() to ignore filters with
> 'deleted' flag set when adding, but I guess reusing 'reoffload_count' to
> retry fl_hw_destroy_filter() would also work.

Yeah, I don't see how you can ignore deleted safely.  Perhaps lack of
coffee :)
Vlad Buslov April 10, 2019, 4:26 p.m. UTC | #10
On Wed 10 Apr 2019 at 19:09, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> On Wed, 10 Apr 2019 16:02:17 +0000, Vlad Buslov wrote:
>> On Wed 10 Apr 2019 at 18:48, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
>> > On Wed, 10 Apr 2019 14:53:53 +0000, Vlad Buslov wrote:
>> >> >> For my next patch set that unlocks the offloads API I implemented the
>> >> >> algorithm to track reoffload count for each tp that works like this:
>> >> >>
>> >> >> 1. struct tcf_proto is extended with reoffload_count counter that
>> >> >>    incremented each time reoffload is called on particular tp instance.
>> >> >>    Counter is protected by tp->lock.
>> >> >>
>> >> >> 2. struct cls_fl_filter is also extended with reoffload_count counter.
>> >> >>    Its value is set to current tp->reoffload_count when offloading the
>> >> >>    filter.
>> >> >>
>> >> >> 3. After offloading the filter, but before inserting it to idr,
>> >> >>    f->reoffload_count is compared with tp->reoffload_count. If values
>> >> >>    don't match, filter is deleted and -EAGAIN is returned. Cls API
>> >> >>    retries filter insertion on -EAGAIN.
>> >> >
>> >> > Sounds good for add.  Does this solve delete case as well?
>> >> >
>> >> >    CPU 0                       CPU 1
>> >> >
>> >> > __fl_delete
>> >> >   IDR remove
>> >> >                            cb unregister
>> >> >                              hw delete all flows  <- doesn't see the
>> >> >                                                      remove in progress
>> >> >
>> >> >   hw delete  <- doesn't see
>> >> >                 the removed cb
>> >>
>> >> Thanks for pointing that out! Looks like I need to move call to hw
>> >> delete in __fl_delete() function to be executed before idr removal.
>> >
>> > Ack, plus you need to do the same retry mechanism.  Save CB "count"/seq,
>> > hw delete, remove from IDR, if CB "count"/seq changed hw delete again.
>> > Right?
>>
>> Actually, I intended to modify fl_reoffload() to ignore filters with
>> 'deleted' flag set when adding, but I guess reusing 'reoffload_count' to
>> retry fl_hw_destroy_filter() would also work.
>
> Yeah, I don't see how you can ignore deleted safely.  Perhaps lack of
> coffee :)

Well, drivers are supposed to account for double deletion or deletion of
filters that were not successfully offloaded to them. If filter is not
marked as skip_sw, its creation will succeed even if hw callbacks have
failed, but __fl_delete() still calls fl_hw_destroy_filter() on such
filters. The main thing is that we must guarantee that code doesn't
delete a new filter with same key. However, in case of flower classifier
'cookie' is pointer to filter, and filter is freed only when last
reference to it is released, so code is safe in this regard.

So I guess there is nothing wrong with reoffload calling cb()
on all classifier filters (including marked as 'deleted'), if delete
code doesn't miss any of the callbacks afterwards.
Jakub Kicinski April 10, 2019, 5 p.m. UTC | #11
On Wed, 10 Apr 2019 16:26:38 +0000, Vlad Buslov wrote:
> >> Actually, I intended to modify fl_reoffload() to ignore filters with
> >> 'deleted' flag set when adding, but I guess reusing 'reoffload_count' to
> >> retry fl_hw_destroy_filter() would also work.  
> >
> > Yeah, I don't see how you can ignore deleted safely.  Perhaps lack of
> > coffee :)  
> 
> Well, drivers are supposed to account for double deletion or deletion of
> filters that were not successfully offloaded to them. If filter is not
> marked as skip_sw, its creation will succeed even if hw callbacks have
> failed, but __fl_delete() still calls fl_hw_destroy_filter() on such
> filters. The main thing is that we must guarantee that code doesn't
> delete a new filter with same key. However, in case of flower classifier
> 'cookie' is pointer to filter, and filter is freed only when last
> reference to it is released, so code is safe in this regard.
> 
> So I guess there is nothing wrong with reoffload calling cb()
> on all classifier filters (including marked as 'deleted'), if delete
> code doesn't miss any of the callbacks afterwards.

Yup.  Plus multiple/spurious deletes are already a fact of life, since
we don't keep track of which callback accepted the filter and which
didn't.
Ido Schimmel April 11, 2019, 11:13 a.m. UTC | #12
On Fri, Apr 05, 2019 at 08:56:26PM +0300, Vlad Buslov wrote:
> John reports:
> 
> Recent refactoring of fl_change aims to use the classifier spinlock to
> avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
> function was moved to before the lock is taken. This can create problems
> for drivers if duplicate filters are created (commmon in ovs tc offload
> due to filters being triggered by user-space matches).
> 
> Drivers registered for such filters will now receive multiple copies of
> the same rule, each with a different cookie value. This means that the
> drivers would need to do a full match field lookup to determine
> duplicates, repeating work that will happen in flower __fl_lookup().
> Currently, drivers do not expect to receive duplicate filters.
> 
> To fix this, verify that filter with same key is not present in flower
> classifier hash table and insert the new filter to the flower hash table
> before offloading it to hardware. Implement helper function
> fl_ht_insert_unique() to atomically verify/insert a filter.
> 
> This change makes filter visible to fast path at the beginning of
> fl_change() function, which means it can no longer be freed directly in
> case of error. Refactor fl_change() error handling code to deallocate the
> filter with rcu timeout.
> 
> Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
> Reported-by: John Hurley <john.hurley@netronome.com>
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Vlad,

Our regression machines all hit a NULL pointer dereference [1] which I
bisected to this patch. Created this reproducer that you can use:

ip netns add ns1
ip -n ns1 link add dev dummy1 type dummy
tc -n ns1 qdisc add dev dummy1 clsact
tc -n ns1 filter add dev dummy1 ingress pref 1 proto ip \
        flower skip_hw src_ip 192.0.2.1 action drop
ip netns del ns1

Can you please look into this? Thanks

[1]
[    5.332176] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[    5.334372] #PF error: [normal kernel read fault]
[    5.335619] PGD 0 P4D 0
[    5.336360] Oops: 0000 [#1] SMP
[    5.337249] CPU: 0 PID: 7 Comm: kworker/u2:0 Not tainted 5.1.0-rc4-custom-01473-g526bb57a6ad6 #1374
[    5.339232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[    5.341982] Workqueue: netns cleanup_net
[    5.342843] RIP: 0010:__fl_put+0x24/0xb0
[    5.343808] Code: 84 00 00 00 00 00 3e ff 8f f0 03 00 00 0f 88 da 7b 14 00 74 01 c3 80 bf f4 03 00 00 00 0f 84 83 00 00 00 4c 8b 87 c8 01 00 00 <41> 8b 50 04 49 8d 70 04
85 d2 74 60 8d 4a 01 39 ca 7f 52 81 fa fe
[    5.348099] RSP: 0018:ffffabe300663be0 EFLAGS: 00010202
[    5.349223] RAX: ffff9ea4ba1aff00 RBX: ffff9ea4b99af400 RCX: ffffabe300663c67
[    5.350572] RDX: 00000000000004c5 RSI: 0000000000000000 RDI: ffff9ea4b99af400
[    5.351919] RBP: ffff9ea4ba28e900 R08: 0000000000000000 R09: ffffffff9d1b0075
[    5.353272] R10: ffffeb2884e66b80 R11: ffffffff9dc4dcd8 R12: ffff9ea4b99af408
[    5.354635] R13: ffff9ea4b99ae400 R14: ffff9ea4b9a47800 R15: ffff9ea4b99ae000
[    5.355988] FS:  0000000000000000(0000) GS:ffff9ea4bba00000(0000) knlGS:0000000000000000
[    5.357436] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.358530] CR2: 0000000000000004 CR3: 00000001398fa004 CR4: 0000000000160ef0
[    5.359876] Call Trace:
[    5.360360]  __fl_delete+0x223/0x3b0
[    5.361008]  fl_destroy+0xb4/0x130
[    5.361641]  tcf_proto_destroy+0x15/0x40
[    5.362429]  tcf_chain_flush+0x4e/0x60
[    5.363125]  __tcf_block_put+0xb4/0x150
[    5.363805]  clsact_destroy+0x30/0x40
[    5.364507]  qdisc_destroy+0x44/0x110
[    5.365218]  dev_shutdown+0x6e/0xa0
[    5.365821]  rollback_registered_many+0x25d/0x510
[    5.366724]  ? netdev_run_todo+0x221/0x280
[    5.367485]  unregister_netdevice_many+0x15/0xa0
[    5.368355]  default_device_exit_batch+0x13f/0x170
[    5.369268]  ? wait_woken+0x80/0x80
[    5.369910]  cleanup_net+0x19a/0x280
[    5.370558]  process_one_work+0x1f5/0x3f0
[    5.371326]  worker_thread+0x28/0x3c0
[    5.372038]  ? process_one_work+0x3f0/0x3f0
[    5.372755]  kthread+0x10d/0x130
[    5.373358]  ? __kthread_create_on_node+0x180/0x180
[    5.374298]  ret_from_fork+0x35/0x40
[    5.374934] CR2: 0000000000000004
[    5.375454] ---[ end trace c20e7f74127772e5 ]---
[    5.376284] RIP: 0010:__fl_put+0x24/0xb0
[    5.377003] Code: 84 00 00 00 00 00 3e ff 8f f0 03 00 00 0f 88 da 7b 14 00 74 01 c3 80 bf f4 03 00 00 00 0f 84 83 00 00 00 4c 8b 87 c8 01 00 00 <41> 8b 50 04 49 8d 70 04
85 d2 74 60 8d 4a 01 39 ca 7f 52 81 fa fe
[    5.380269] RSP: 0018:ffffabe300663be0 EFLAGS: 00010202
[    5.381237] RAX: ffff9ea4ba1aff00 RBX: ffff9ea4b99af400 RCX: ffffabe300663c67
[    5.382551] RDX: 00000000000004c5 RSI: 0000000000000000 RDI: ffff9ea4b99af400
[    5.383972] RBP: ffff9ea4ba28e900 R08: 0000000000000000 R09: ffffffff9d1b0075
[    5.385314] R10: ffffeb2884e66b80 R11: ffffffff9dc4dcd8 R12: ffff9ea4b99af408
[    5.386616] R13: ffff9ea4b99ae400 R14: ffff9ea4b9a47800 R15: ffff9ea4b99ae000
[    5.387986] FS:  0000000000000000(0000) GS:ffff9ea4bba00000(0000) knlGS:0000000000000000
[    5.389512] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.390546] CR2: 0000000000000004 CR3: 00000001398fa004 CR4: 0000000000160ef0
Vlad Buslov April 11, 2019, 11:28 a.m. UTC | #13
On Thu 11 Apr 2019 at 14:13, Ido Schimmel <idosch@idosch.org> wrote:
> On Fri, Apr 05, 2019 at 08:56:26PM +0300, Vlad Buslov wrote:
>> John reports:
>> 
>> Recent refactoring of fl_change aims to use the classifier spinlock to
>> avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
>> function was moved to before the lock is taken. This can create problems
>> for drivers if duplicate filters are created (commmon in ovs tc offload
>> due to filters being triggered by user-space matches).
>> 
>> Drivers registered for such filters will now receive multiple copies of
>> the same rule, each with a different cookie value. This means that the
>> drivers would need to do a full match field lookup to determine
>> duplicates, repeating work that will happen in flower __fl_lookup().
>> Currently, drivers do not expect to receive duplicate filters.
>> 
>> To fix this, verify that filter with same key is not present in flower
>> classifier hash table and insert the new filter to the flower hash table
>> before offloading it to hardware. Implement helper function
>> fl_ht_insert_unique() to atomically verify/insert a filter.
>> 
>> This change makes filter visible to fast path at the beginning of
>> fl_change() function, which means it can no longer be freed directly in
>> case of error. Refactor fl_change() error handling code to deallocate the
>> filter with rcu timeout.
>> 
>> Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
>> Reported-by: John Hurley <john.hurley@netronome.com>
>> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>
> Vlad,
>
> Our regression machines all hit a NULL pointer dereference [1] which I
> bisected to this patch. Created this reproducer that you can use:
>
> ip netns add ns1
> ip -n ns1 link add dev dummy1 type dummy
> tc -n ns1 qdisc add dev dummy1 clsact
> tc -n ns1 filter add dev dummy1 ingress pref 1 proto ip \
>         flower skip_hw src_ip 192.0.2.1 action drop
> ip netns del ns1
>
> Can you please look into this? Thanks

Will do.
diff mbox series

Patch

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 6050e3caee31..2763176e369c 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1459,6 +1459,28 @@  static int fl_set_parms(struct net *net, struct tcf_proto *tp,
 	return 0;
 }
 
+static int fl_ht_insert_unique(struct cls_fl_filter *fnew,
+			       struct cls_fl_filter *fold,
+			       bool *in_ht)
+{
+	struct fl_flow_mask *mask = fnew->mask;
+	int err;
+
+	err = rhashtable_insert_fast(&mask->ht,
+				     &fnew->ht_node,
+				     mask->filter_ht_params);
+	if (err) {
+		*in_ht = false;
+		/* It is okay if filter with same key exists when
+		 * overwriting.
+		 */
+		return fold && err == -EEXIST ? 0 : err;
+	}
+
+	*in_ht = true;
+	return 0;
+}
+
 static int fl_change(struct net *net, struct sk_buff *in_skb,
 		     struct tcf_proto *tp, unsigned long base,
 		     u32 handle, struct nlattr **tca,
@@ -1470,6 +1492,7 @@  static int fl_change(struct net *net, struct sk_buff *in_skb,
 	struct cls_fl_filter *fnew;
 	struct fl_flow_mask *mask;
 	struct nlattr **tb;
+	bool in_ht;
 	int err;
 
 	if (!tca[TCA_OPTIONS]) {
@@ -1528,10 +1551,14 @@  static int fl_change(struct net *net, struct sk_buff *in_skb,
 	if (err)
 		goto errout;
 
+	err = fl_ht_insert_unique(fnew, fold, &in_ht);
+	if (err)
+		goto errout_mask;
+
 	if (!tc_skip_hw(fnew->flags)) {
 		err = fl_hw_replace_filter(tp, fnew, rtnl_held, extack);
 		if (err)
-			goto errout_mask;
+			goto errout_ht;
 	}
 
 	if (!tc_in_hw(fnew->flags))
@@ -1557,10 +1584,17 @@  static int fl_change(struct net *net, struct sk_buff *in_skb,
 
 		fnew->handle = handle;
 
-		err = rhashtable_insert_fast(&fnew->mask->ht, &fnew->ht_node,
-					     fnew->mask->filter_ht_params);
-		if (err)
-			goto errout_hw;
+		if (!in_ht) {
+			struct rhashtable_params params =
+				fnew->mask->filter_ht_params;
+
+			err = rhashtable_insert_fast(&fnew->mask->ht,
+						     &fnew->ht_node,
+						     params);
+			if (err)
+				goto errout_hw;
+			in_ht = true;
+		}
 
 		rhashtable_remove_fast(&fold->mask->ht,
 				       &fold->ht_node,
@@ -1582,11 +1616,6 @@  static int fl_change(struct net *net, struct sk_buff *in_skb,
 		refcount_dec(&fold->refcnt);
 		__fl_put(fold);
 	} else {
-		if (__fl_lookup(fnew->mask, &fnew->mkey)) {
-			err = -EEXIST;
-			goto errout_hw;
-		}
-
 		if (handle) {
 			/* user specifies a handle and it doesn't exist */
 			err = idr_alloc_u32(&head->handle_idr, fnew, &handle,
@@ -1609,12 +1638,6 @@  static int fl_change(struct net *net, struct sk_buff *in_skb,
 			goto errout_hw;
 
 		fnew->handle = handle;
-
-		err = rhashtable_insert_fast(&fnew->mask->ht, &fnew->ht_node,
-					     fnew->mask->filter_ht_params);
-		if (err)
-			goto errout_idr;
-
 		list_add_tail_rcu(&fnew->list, &fnew->mask->filters);
 		spin_unlock(&tp->lock);
 	}
@@ -1625,17 +1648,18 @@  static int fl_change(struct net *net, struct sk_buff *in_skb,
 	kfree(mask);
 	return 0;
 
-errout_idr:
-	idr_remove(&head->handle_idr, fnew->handle);
 errout_hw:
 	spin_unlock(&tp->lock);
 	if (!tc_skip_hw(fnew->flags))
 		fl_hw_destroy_filter(tp, fnew, rtnl_held, NULL);
+errout_ht:
+	if (in_ht)
+		rhashtable_remove_fast(&fnew->mask->ht, &fnew->ht_node,
+				       fnew->mask->filter_ht_params);
 errout_mask:
 	fl_mask_put(head, fnew->mask, true);
 errout:
-	tcf_exts_destroy(&fnew->exts);
-	kfree(fnew);
+	tcf_queue_work(&fnew->rwork, fl_destroy_filter_work);
 errout_tb:
 	kfree(tb);
 errout_mask_alloc: