diff mbox

[net-next] net/sched: cls_flower: verify root pointer before dereferncing it

Message ID 1479824726-62607-1-git-send-email-roid@mellanox.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Roi Dayan Nov. 22, 2016, 2:25 p.m. UTC
tp->root is being allocated in init() time and kfreed in destroy()
however it is being dereferenced in classify() path.

We could be in classify() path after destroy() was called and thus 
tp->root is null. Verifying if tp->root is null in classify() path 
is enough because it's being freed with kfree_rcu() and classify() 
path is under rcu_read_lock().

Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Cc: Cong Wang <cwang@twopensource.com>
---

Hi Cong, all

As stated above, the issue was introduced with commit 1e052be69d04 ("net_sched: destroy 
proto tp when all filters are gone"). This patch provides a fix only for cls_flower where 
I succeeded in reproducing the issue. Cong, if you can/want to come up with a fix that
will be applicable for all the others classifiners, I am fine with that.

Thanks,
Roi


 net/sched/cls_flower.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jiri Pirko Nov. 22, 2016, 2:48 p.m. UTC | #1
Tue, Nov 22, 2016 at 03:25:26PM CET, roid@mellanox.com wrote:
>tp->root is being allocated in init() time and kfreed in destroy()
>however it is being dereferenced in classify() path.
>
>We could be in classify() path after destroy() was called and thus 
>tp->root is null. Verifying if tp->root is null in classify() path 
>is enough because it's being freed with kfree_rcu() and classify() 
>path is under rcu_read_lock().
>
>Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
>Signed-off-by: Roi Dayan <roid@mellanox.com>
>Cc: Cong Wang <cwang@twopensource.com>

This is correct

Reviewed-by: Jiri Pirko <jiri@mellanox.com>

The other way to fix this would be to move tp->ops->destroy call to
call_rcu phase. That would require bigger changes though. net-next
perhaps?



>---
>
>Hi Cong, all
>
>As stated above, the issue was introduced with commit 1e052be69d04 ("net_sched: destroy 
>proto tp when all filters are gone"). This patch provides a fix only for cls_flower where 
>I succeeded in reproducing the issue. Cong, if you can/want to come up with a fix that
>will be applicable for all the others classifiners, I am fine with that.
>
>Thanks,
>Roi
>
>
> net/sched/cls_flower.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>index e8dd09a..88a26c4 100644
>--- a/net/sched/cls_flower.c
>+++ b/net/sched/cls_flower.c
>@@ -135,7 +135,7 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
> 	struct fl_flow_key skb_mkey;
> 	struct ip_tunnel_info *info;
> 
>-	if (!atomic_read(&head->ht.nelems))
>+	if (!head || !atomic_read(&head->ht.nelems))
> 		return -1;
> 
> 	fl_clear_masked_range(&skb_key, &head->mask);
>-- 
>2.7.4
>
David Miller Nov. 22, 2016, 3:37 p.m. UTC | #2
From: Jiri Pirko <jiri@resnulli.us>
Date: Tue, 22 Nov 2016 15:48:44 +0100

> Tue, Nov 22, 2016 at 03:25:26PM CET, roid@mellanox.com wrote:
>>tp->root is being allocated in init() time and kfreed in destroy()
>>however it is being dereferenced in classify() path.
>>
>>We could be in classify() path after destroy() was called and thus 
>>tp->root is null. Verifying if tp->root is null in classify() path 
>>is enough because it's being freed with kfree_rcu() and classify() 
>>path is under rcu_read_lock().
>>
>>Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
>>Signed-off-by: Roi Dayan <roid@mellanox.com>
>>Cc: Cong Wang <cwang@twopensource.com>
> 
> This is correct
> 
> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
> 
> The other way to fix this would be to move tp->ops->destroy call to
> call_rcu phase. That would require bigger changes though. net-next
> perhaps?

This patch is targetted at net-next as per Subj.
Daniel Borkmann Nov. 22, 2016, 4:04 p.m. UTC | #3
[ + John ]

On 11/22/2016 03:48 PM, Jiri Pirko wrote:
> Tue, Nov 22, 2016 at 03:25:26PM CET, roid@mellanox.com wrote:
>> tp->root is being allocated in init() time and kfreed in destroy()
>> however it is being dereferenced in classify() path.
>>
>> We could be in classify() path after destroy() was called and thus
>> tp->root is null. Verifying if tp->root is null in classify() path
>> is enough because it's being freed with kfree_rcu() and classify()
>> path is under rcu_read_lock().
>>
>> Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
>> Signed-off-by: Roi Dayan <roid@mellanox.com>
>> Cc: Cong Wang <cwang@twopensource.com>
>
> This is correct
>
> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
>
> The other way to fix this would be to move tp->ops->destroy call to
> call_rcu phase. That would require bigger changes though. net-next
> perhaps?

Hmm, I don't think we want to have such an additional test in fast
path for each and every classifier. Can we think of ways to avoid that?

My question is, since we unlink individual instances from such tp-internal
lists through RCU and release the instance through call_rcu() as well as
the head (tp->root) via kfree_rcu() eventually, against what are we protecting
setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback? Something
not respecting grace period?

The only thing that actually checks if tp->root is NULL right now is the
get() callback. Is that the reason why tp->root is RCU'ified? John?

Thanks,
Daniel

>> Hi Cong, all
>>
>> As stated above, the issue was introduced with commit 1e052be69d04 ("net_sched: destroy
>> proto tp when all filters are gone"). This patch provides a fix only for cls_flower where
>> I succeeded in reproducing the issue. Cong, if you can/want to come up with a fix that
>> will be applicable for all the others classifiners, I am fine with that.
>>
>> Thanks,
>> Roi
>>
>>
>> net/sched/cls_flower.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>> index e8dd09a..88a26c4 100644
>> --- a/net/sched/cls_flower.c
>> +++ b/net/sched/cls_flower.c
>> @@ -135,7 +135,7 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>> 	struct fl_flow_key skb_mkey;
>> 	struct ip_tunnel_info *info;
>>
>> -	if (!atomic_read(&head->ht.nelems))
>> +	if (!head || !atomic_read(&head->ht.nelems))
>> 		return -1;
>>
>> 	fl_clear_masked_range(&skb_key, &head->mask);
>> --
>> 2.7.4
>>
Jiri Pirko Nov. 22, 2016, 4:11 p.m. UTC | #4
Tue, Nov 22, 2016 at 05:04:11PM CET, daniel@iogearbox.net wrote:
>[ + John ]
>
>On 11/22/2016 03:48 PM, Jiri Pirko wrote:
>> Tue, Nov 22, 2016 at 03:25:26PM CET, roid@mellanox.com wrote:
>> > tp->root is being allocated in init() time and kfreed in destroy()
>> > however it is being dereferenced in classify() path.
>> > 
>> > We could be in classify() path after destroy() was called and thus
>> > tp->root is null. Verifying if tp->root is null in classify() path
>> > is enough because it's being freed with kfree_rcu() and classify()
>> > path is under rcu_read_lock().
>> > 
>> > Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
>> > Signed-off-by: Roi Dayan <roid@mellanox.com>
>> > Cc: Cong Wang <cwang@twopensource.com>
>> 
>> This is correct
>> 
>> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
>> 
>> The other way to fix this would be to move tp->ops->destroy call to
>> call_rcu phase. That would require bigger changes though. net-next
>> perhaps?
>
>Hmm, I don't think we want to have such an additional test in fast
>path for each and every classifier. Can we think of ways to avoid that?
>
>My question is, since we unlink individual instances from such tp-internal
>lists through RCU and release the instance through call_rcu() as well as
>the head (tp->root) via kfree_rcu() eventually, against what are we protecting
>setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback? Something
>not respecting grace period?

If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
to null.


>
>The only thing that actually checks if tp->root is NULL right now is the
>get() callback. Is that the reason why tp->root is RCU'ified? John?
>
>Thanks,
>Daniel
>
>> > Hi Cong, all
>> > 
>> > As stated above, the issue was introduced with commit 1e052be69d04 ("net_sched: destroy
>> > proto tp when all filters are gone"). This patch provides a fix only for cls_flower where
>> > I succeeded in reproducing the issue. Cong, if you can/want to come up with a fix that
>> > will be applicable for all the others classifiners, I am fine with that.
>> > 
>> > Thanks,
>> > Roi
>> > 
>> > 
>> > net/sched/cls_flower.c | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>> > index e8dd09a..88a26c4 100644
>> > --- a/net/sched/cls_flower.c
>> > +++ b/net/sched/cls_flower.c
>> > @@ -135,7 +135,7 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>> > 	struct fl_flow_key skb_mkey;
>> > 	struct ip_tunnel_info *info;
>> > 
>> > -	if (!atomic_read(&head->ht.nelems))
>> > +	if (!head || !atomic_read(&head->ht.nelems))
>> > 		return -1;
>> > 
>> > 	fl_clear_masked_range(&skb_key, &head->mask);
>> > --
>> > 2.7.4
>> > 
>
Jiri Pirko Nov. 22, 2016, 4:13 p.m. UTC | #5
Tue, Nov 22, 2016 at 04:37:42PM CET, davem@davemloft.net wrote:
>From: Jiri Pirko <jiri@resnulli.us>
>Date: Tue, 22 Nov 2016 15:48:44 +0100
>
>> Tue, Nov 22, 2016 at 03:25:26PM CET, roid@mellanox.com wrote:
>>>tp->root is being allocated in init() time and kfreed in destroy()
>>>however it is being dereferenced in classify() path.
>>>
>>>We could be in classify() path after destroy() was called and thus 
>>>tp->root is null. Verifying if tp->root is null in classify() path 
>>>is enough because it's being freed with kfree_rcu() and classify() 
>>>path is under rcu_read_lock().
>>>
>>>Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
>>>Signed-off-by: Roi Dayan <roid@mellanox.com>
>>>Cc: Cong Wang <cwang@twopensource.com>
>> 
>> This is correct
>> 
>> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
>> 
>> The other way to fix this would be to move tp->ops->destroy call to
>> call_rcu phase. That would require bigger changes though. net-next
>> perhaps?
>
>This patch is targetted at net-next as per Subj.

Oh, right, then it should be fixed so the tp->head could be never null
Cong Wang Nov. 22, 2016, 7:28 p.m. UTC | #6
On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Nov 22, 2016 at 05:04:11PM CET, daniel@iogearbox.net wrote:
>>Hmm, I don't think we want to have such an additional test in fast
>>path for each and every classifier. Can we think of ways to avoid that?
>>
>>My question is, since we unlink individual instances from such tp-internal
>>lists through RCU and release the instance through call_rcu() as well as
>>the head (tp->root) via kfree_rcu() eventually, against what are we protecting
>>setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback? Something
>>not respecting grace period?
>
> If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
> to null.

We do need to respect the grace period if we touch the globally visible
data structure tp in tcf_destroy(). Therefore Roi's patch is not fixing the
right place.

Also I don't know why you blame my commit, this problem should already
exist prior to my commit, probably date back to John's RCU patches.

I am working on a patch.
Daniel Borkmann Nov. 22, 2016, 8:41 p.m. UTC | #7
On 11/22/2016 08:28 PM, Cong Wang wrote:
> On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Tue, Nov 22, 2016 at 05:04:11PM CET, daniel@iogearbox.net wrote:
>>> Hmm, I don't think we want to have such an additional test in fast
>>> path for each and every classifier. Can we think of ways to avoid that?
>>>
>>> My question is, since we unlink individual instances from such tp-internal
>>> lists through RCU and release the instance through call_rcu() as well as
>>> the head (tp->root) via kfree_rcu() eventually, against what are we protecting
>>> setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback? Something
>>> not respecting grace period?
>>
>> If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
>> to null.

But that's not really an answer to my question. ;)

> We do need to respect the grace period if we touch the globally visible
> data structure tp in tcf_destroy(). Therefore Roi's patch is not fixing the
> right place.

I think there may be multiple issues actually.

At the time we go into tc_classify(), from ingress as well as egress side,
we're under RCU, but BH variant. In cls delete()/destroy() callbacks, we
everywhere use call_rcu() and kfree_rcu(), same as for tcf_destroy() where
we use kfree_rcu() on tp, although we iterate tps (and implicitly inner filters)
via rcu_dereference_bh() from reader side. Is there a reason why we don't
use call_rcu_bh() variant on destruction for all this instead?

Just looking at cls_bpf and others, what protects RCU_INIT_POINTER(tp->root,
NULL) against? The tp is unlinked in tc_ctl_tfilter() from the tp chain in
tcf_destroy() cases. Still active readers under RCU BH can race against this
(tp->root being NULL), as the commit identified. Only the get() callback checks
for head against NULL, but both are serialized under rtnl, and the only place
we call this is tc_ctl_tfilter(). Even if we create a new tp, head should not
be NULL there, if it was assigned during the init() cb, but contains an empty
list. (It's different for things like cls_cgroup, though.) So, I'm wondering
if the RCU_INIT_POINTER(tp->root, NULL) can just be removed instead (unless I'm
missing something obvious)?

> Also I don't know why you blame my commit, this problem should already
> exist prior to my commit, probably date back to John's RCU patches.

It seems so.
John Fastabend Nov. 22, 2016, 11:36 p.m. UTC | #8
On 16-11-22 12:41 PM, Daniel Borkmann wrote:
> On 11/22/2016 08:28 PM, Cong Wang wrote:
>> On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Tue, Nov 22, 2016 at 05:04:11PM CET, daniel@iogearbox.net wrote:
>>>> Hmm, I don't think we want to have such an additional test in fast
>>>> path for each and every classifier. Can we think of ways to avoid that?
>>>>
>>>> My question is, since we unlink individual instances from such
>>>> tp-internal
>>>> lists through RCU and release the instance through call_rcu() as
>>>> well as
>>>> the head (tp->root) via kfree_rcu() eventually, against what are we
>>>> protecting
>>>> setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback?
>>>> Something
>>>> not respecting grace period?
>>>
>>> If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
>>> to null.
> 
> But that's not really an answer to my question. ;)
> 
>> We do need to respect the grace period if we touch the globally visible
>> data structure tp in tcf_destroy(). Therefore Roi's patch is not
>> fixing the
>> right place.
> 
> I think there may be multiple issues actually.
> 
> At the time we go into tc_classify(), from ingress as well as egress side,
> we're under RCU, but BH variant. In cls delete()/destroy() callbacks, we
> everywhere use call_rcu() and kfree_rcu(), same as for tcf_destroy() where
> we use kfree_rcu() on tp, although we iterate tps (and implicitly inner
> filters)
> via rcu_dereference_bh() from reader side. Is there a reason why we don't
> use call_rcu_bh() variant on destruction for all this instead?

I can't think of any if its all under _bh we can convert the call_rcu to
call_rcu_bh it just needs an audit.

> 
> Just looking at cls_bpf and others, what protects
> RCU_INIT_POINTER(tp->root,
> NULL) against? The tp is unlinked in tc_ctl_tfilter() from the tp chain in
> tcf_destroy() cases. Still active readers under RCU BH can race against
> this
> (tp->root being NULL), as the commit identified. Only the get() callback
> checks
> for head against NULL, but both are serialized under rtnl, and the only
> place
> we call this is tc_ctl_tfilter(). Even if we create a new tp, head
> should not
> be NULL there, if it was assigned during the init() cb, but contains an
> empty
> list. (It's different for things like cls_cgroup, though.) So, I'm
> wondering
> if the RCU_INIT_POINTER(tp->root, NULL) can just be removed instead
> (unless I'm
> missing something obvious)?


Just took a look at this I think there are a couple possible solutions.
The easiest is likely to fix all the call sites so that 'tp' is unlinked
before calling the destroy() handlers AND not doing the NULL set. I only
see one such call site where destroy is called before unlinking at the
moment. This should enforce that after a grace period there is no path
to reach the classifiers because 'tp' is unlinked. Calling destroy
before unlinking 'tp' however could cause a small race between grace
period of 'tp' and grace period of the filter.

Another would be to only call the destroy path from the call_rcu path
of the 'tp' object so that destroy is only ever called after the object
is guaranteed to be unlinked from the tc_filter path.

I think both solutions would be fine.

Cong were you working on one of these? Or do you have another idea?


> 
>> Also I don't know why you blame my commit, this problem should already
>> exist prior to my commit, probably date back to John's RCU patches.
> 
> It seems so.
Cong Wang Nov. 23, 2016, 5:24 a.m. UTC | #9
On Tue, Nov 22, 2016 at 3:36 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 16-11-22 12:41 PM, Daniel Borkmann wrote:
>> On 11/22/2016 08:28 PM, Cong Wang wrote:
>>> On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> Tue, Nov 22, 2016 at 05:04:11PM CET, daniel@iogearbox.net wrote:
>>>>> Hmm, I don't think we want to have such an additional test in fast
>>>>> path for each and every classifier. Can we think of ways to avoid that?
>>>>>
>>>>> My question is, since we unlink individual instances from such
>>>>> tp-internal
>>>>> lists through RCU and release the instance through call_rcu() as
>>>>> well as
>>>>> the head (tp->root) via kfree_rcu() eventually, against what are we
>>>>> protecting
>>>>> setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback?
>>>>> Something
>>>>> not respecting grace period?
>>>>
>>>> If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
>>>> to null.
>>
>> But that's not really an answer to my question. ;)
>>
>>> We do need to respect the grace period if we touch the globally visible
>>> data structure tp in tcf_destroy(). Therefore Roi's patch is not
>>> fixing the
>>> right place.
>>
>> I think there may be multiple issues actually.
>>
>> At the time we go into tc_classify(), from ingress as well as egress side,
>> we're under RCU, but BH variant. In cls delete()/destroy() callbacks, we
>> everywhere use call_rcu() and kfree_rcu(), same as for tcf_destroy() where
>> we use kfree_rcu() on tp, although we iterate tps (and implicitly inner
>> filters)
>> via rcu_dereference_bh() from reader side. Is there a reason why we don't
>> use call_rcu_bh() variant on destruction for all this instead?
>
> I can't think of any if its all under _bh we can convert the call_rcu to
> call_rcu_bh it just needs an audit.
>
>>
>> Just looking at cls_bpf and others, what protects
>> RCU_INIT_POINTER(tp->root,
>> NULL) against? The tp is unlinked in tc_ctl_tfilter() from the tp chain in
>> tcf_destroy() cases. Still active readers under RCU BH can race against
>> this
>> (tp->root being NULL), as the commit identified. Only the get() callback
>> checks
>> for head against NULL, but both are serialized under rtnl, and the only
>> place
>> we call this is tc_ctl_tfilter(). Even if we create a new tp, head
>> should not
>> be NULL there, if it was assigned during the init() cb, but contains an
>> empty
>> list. (It's different for things like cls_cgroup, though.) So, I'm
>> wondering
>> if the RCU_INIT_POINTER(tp->root, NULL) can just be removed instead
>> (unless I'm
>> missing something obvious)?
>
>
> Just took a look at this I think there are a couple possible solutions.
> The easiest is likely to fix all the call sites so that 'tp' is unlinked
> before calling the destroy() handlers AND not doing the NULL set. I only
> see one such call site where destroy is called before unlinking at the
> moment. This should enforce that after a grace period there is no path
> to reach the classifiers because 'tp' is unlinked. Calling destroy
> before unlinking 'tp' however could cause a small race between grace
> period of 'tp' and grace period of the filter.
>
> Another would be to only call the destroy path from the call_rcu path
> of the 'tp' object so that destroy is only ever called after the object
> is guaranteed to be unlinked from the tc_filter path.
>
> I think both solutions would be fine.
>
> Cong were you working on one of these? Or do you have another idea?

Yeah, this is basic what I think as well, however, both are hard.
On one hand, we can't detach the tp from the global singly-linked list
before tcf_destroy() since we rely on its return value to make this decision.
On the other hand, it is a singly-linked list, we have to pass in the address
of its previous pointer to rcu callback to remove it, it seems racy as well
since we modify a previous pointer which is still visible globally...

Hmm, perhaps we really have to switch to a doubly-linked list, that is
list_head. I need to double check. And also the semantic of ->destroy()
needs to revise too.

So yeah, my commit should be blamed. :-/
Daniel Borkmann Nov. 23, 2016, 11:29 a.m. UTC | #10
On 11/23/2016 06:24 AM, Cong Wang wrote:
> On Tue, Nov 22, 2016 at 3:36 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> On 16-11-22 12:41 PM, Daniel Borkmann wrote:
>>> On 11/22/2016 08:28 PM, Cong Wang wrote:
>>>> On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Tue, Nov 22, 2016 at 05:04:11PM CET, daniel@iogearbox.net wrote:
>>>>>> Hmm, I don't think we want to have such an additional test in fast
>>>>>> path for each and every classifier. Can we think of ways to avoid that?
>>>>>>
>>>>>> My question is, since we unlink individual instances from such
>>>>>> tp-internal
>>>>>> lists through RCU and release the instance through call_rcu() as
>>>>>> well as
>>>>>> the head (tp->root) via kfree_rcu() eventually, against what are we
>>>>>> protecting
>>>>>> setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback?
>>>>>> Something
>>>>>> not respecting grace period?
>>>>>
>>>>> If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
>>>>> to null.
>>>
>>> But that's not really an answer to my question. ;)
>>>
>>>> We do need to respect the grace period if we touch the globally visible
>>>> data structure tp in tcf_destroy(). Therefore Roi's patch is not
>>>> fixing the
>>>> right place.
>>>
>>> I think there may be multiple issues actually.
>>>
>>> At the time we go into tc_classify(), from ingress as well as egress side,
>>> we're under RCU, but BH variant. In cls delete()/destroy() callbacks, we
>>> everywhere use call_rcu() and kfree_rcu(), same as for tcf_destroy() where
>>> we use kfree_rcu() on tp, although we iterate tps (and implicitly inner
>>> filters)
>>> via rcu_dereference_bh() from reader side. Is there a reason why we don't
>>> use call_rcu_bh() variant on destruction for all this instead?
>>
>> I can't think of any if its all under _bh we can convert the call_rcu to
>> call_rcu_bh it just needs an audit.
>>
>>> Just looking at cls_bpf and others, what protects
>>> RCU_INIT_POINTER(tp->root,
>>> NULL) against? The tp is unlinked in tc_ctl_tfilter() from the tp chain in
>>> tcf_destroy() cases. Still active readers under RCU BH can race against
>>> this
>>> (tp->root being NULL), as the commit identified. Only the get() callback
>>> checks
>>> for head against NULL, but both are serialized under rtnl, and the only
>>> place
>>> we call this is tc_ctl_tfilter(). Even if we create a new tp, head
>>> should not
>>> be NULL there, if it was assigned during the init() cb, but contains an
>>> empty
>>> list. (It's different for things like cls_cgroup, though.) So, I'm
>>> wondering
>>> if the RCU_INIT_POINTER(tp->root, NULL) can just be removed instead
>>> (unless I'm
>>> missing something obvious)?
>>
>> Just took a look at this I think there are a couple possible solutions.
>> The easiest is likely to fix all the call sites so that 'tp' is unlinked
>> before calling the destroy() handlers AND not doing the NULL set. I only
>> see one such call site where destroy is called before unlinking at the
>> moment. This should enforce that after a grace period there is no path
>> to reach the classifiers because 'tp' is unlinked. Calling destroy
>> before unlinking 'tp' however could cause a small race between grace
>> period of 'tp' and grace period of the filter.
>>
>> Another would be to only call the destroy path from the call_rcu path
>> of the 'tp' object so that destroy is only ever called after the object
>> is guaranteed to be unlinked from the tc_filter path.
>>
>> I think both solutions would be fine.
>>
>> Cong were you working on one of these? Or do you have another idea?
>
> Yeah, this is basic what I think as well, however, both are hard.
> On one hand, we can't detach the tp from the global singly-linked list
> before tcf_destroy() since we rely on its return value to make this decision.
> On the other hand, it is a singly-linked list, we have to pass in the address
> of its previous pointer to rcu callback to remove it, it seems racy as well
> since we modify a previous pointer which is still visible globally...

Can't we drop the 'force' parameter from tcf_destroy() and related cls
destroy() callbacks, and change the logic roughly like this:

[...]
         case RTM_DELTFILTER:
                 err = tp->ops->delete(tp, fh, &drop_tp);
                 if (err == 0) {
                         struct tcf_proto *next = rtnl_dereference(tp->next);

                         tfilter_notify(net, skb, n, tp,
                                        t->tcm_handle,
                                        RTM_DELTFILTER, false);
                         if (drop_tp) {
                                 RCU_INIT_POINTER(*back, next);
                                 tcf_destroy(tp);
                         }
                 }
                 goto errout;
[...]

This one was the only tcf_destroy() instance with force=false. Why can't
the prior delete() callback make the decision whether the tp now has no
further internal filters and thus can be dropped. Afaik, delete() and
destroy() are protected by RTNL anyway. Thus, we could unlink the tp from
the list before tcf_destroy(), which should then work with grace period
as well. Given we remove the setting of tp->root to NULL, any outstanding
readers for that grace period should either still execute the 'scheduled
for removal' filter we just dropped, or find an empty list of filters.

> Hmm, perhaps we really have to switch to a doubly-linked list, that is
> list_head. I need to double check. And also the semantic of ->destroy()
> needs to revise too.

Can you elaborate why double-linked list? Isn't the tp list always protected
from modifications via RTNL in control path, and walked via rcu_dereference_bh()
in data path?

> So yeah, my commit should be blamed. :-/
Cong Wang Nov. 23, 2016, 7:29 p.m. UTC | #11
On Wed, Nov 23, 2016 at 3:29 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Can't we drop the 'force' parameter from tcf_destroy() and related cls
> destroy() callbacks, and change the logic roughly like this:
>
> [...]
>         case RTM_DELTFILTER:
>                 err = tp->ops->delete(tp, fh, &drop_tp);
>                 if (err == 0) {
>                         struct tcf_proto *next = rtnl_dereference(tp->next);
>
>                         tfilter_notify(net, skb, n, tp,
>                                        t->tcm_handle,
>                                        RTM_DELTFILTER, false);
>                         if (drop_tp) {
>                                 RCU_INIT_POINTER(*back, next);
>                                 tcf_destroy(tp);
>                         }
>                 }
>                 goto errout;
> [...]
>
> This one was the only tcf_destroy() instance with force=false. Why can't
> the prior delete() callback make the decision whether the tp now has no
> further internal filters and thus can be dropped. Afaik, delete() and
> destroy() are protected by RTNL anyway. Thus, we could unlink the tp from
> the list before tcf_destroy(), which should then work with grace period
> as well. Given we remove the setting of tp->root to NULL, any outstanding
> readers for that grace period should either still execute the 'scheduled
> for removal' filter we just dropped, or find an empty list of filters.

This is exactly why I said "the semantic of ->destroy() needs to revise too",
this is a reasonable revision of course, but the change is still large because
we need to move that logic from ->destroy() to ->delete(). I was trying to find
a relatively small fix for -net and -stable, for -net-next we could do
aggressive
change as long as it's necessary. This is why I am still thinking about it,
perhaps there is no quick fix for this bug.


>
>> Hmm, perhaps we really have to switch to a doubly-linked list, that is
>> list_head. I need to double check. And also the semantic of ->destroy()
>> needs to revise too.
>
>
> Can you elaborate why double-linked list? Isn't the tp list always protected
> from modifications via RTNL in control path, and walked via
> rcu_dereference_bh()
> in data path?

At least two benefits we can get from using doubly-linked list:

1) No need to pass a 'prev' pointer if we want to remove tp in a RCU callback,
list_del_rcu(&tp->head) is just enough.

2) No need to worry about RCU pointers because list_head has RCU API's
already, much more readable to me.

Of course, the size of struct tcf_proto will grow a bit, but it doesn't seem to
be a problem.
David Miller Nov. 24, 2016, 8:25 p.m. UTC | #12
From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue, 22 Nov 2016 11:28:37 -0800

> On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Tue, Nov 22, 2016 at 05:04:11PM CET, daniel@iogearbox.net wrote:
>>>Hmm, I don't think we want to have such an additional test in fast
>>>path for each and every classifier. Can we think of ways to avoid that?
>>>
>>>My question is, since we unlink individual instances from such tp-internal
>>>lists through RCU and release the instance through call_rcu() as well as
>>>the head (tp->root) via kfree_rcu() eventually, against what are we protecting
>>>setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback? Something
>>>not respecting grace period?
>>
>> If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
>> to null.
> 
> We do need to respect the grace period if we touch the globally visible
> data structure tp in tcf_destroy(). Therefore Roi's patch is not fixing the
> right place.

Another idea is to assign tp->root to a dummy static cls_fl_head object,
instead of NULL, which we just make sure has an ht.elems value of zero.

This avoids having to touch the fast path and also avoids all of these
complicated changes being discussed wrt. doing things in call_rcu_bh()
or whatever.
diff mbox

Patch

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index e8dd09a..88a26c4 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -135,7 +135,7 @@  static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 	struct fl_flow_key skb_mkey;
 	struct ip_tunnel_info *info;
 
-	if (!atomic_read(&head->ht.nelems))
+	if (!head || !atomic_read(&head->ht.nelems))
 		return -1;
 
 	fl_clear_masked_range(&skb_key, &head->mask);