[1/7] fix hnode refcounting

Message ID	20180905190414.5477-1-viro@ZenIV.linux.org.uk
State	Changes Requested, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Al Viro <viro@ZenIV.linux.org.uk> To: netdev@vger.kernel.org Cc: Jamal Hadi Salim <jhs@mojatatu.com>, Cong Wang <xiyou.wangcong@gmail.com>, Jiri Pirko <jiri@resnulli.us>, stable@vger.kernel.org Subject: [PATCH 1/7] fix hnode refcounting Date: Wed, 5 Sep 2018 20:04:08 +0100 Message-Id: <20180905190414.5477-1-viro@ZenIV.linux.org.uk> In-Reply-To: <20180905190134.GQ19965@ZenIV.linux.org.uk> References: <20180905190134.GQ19965@ZenIV.linux.org.uk> Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	[1/7] fix hnode refcounting \| expand [1/7] fix hnode refcounting [2/7] mark root hnode explicitly [3/7] make sure that divisor is a power of 2 [4/7] get rid of unused argument of u32_destroy_key() [5/7] get rid of tc_u_knode ->tp [6/7] get rid of tc_u_common ->rcu [7/7] clean tc_u_common hashtable

Al Viro Sept. 5, 2018, 7:04 p.m. UTC

From: Al Viro <viro@zeniv.linux.org.uk>

cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via
->hlist and via ->tp_root together.  u32_destroy() drops the former and, in
case when there had been links, leaves the sucker on the list.  As the result,
there's nothing to protect it from getting freed once links are dropped.
That also makes the "is it busy" check incapable of catching the root hnode -
it *is* busy (there's a reference from tp), but we don't see it as something
separate.  "Is it our root?" check partially covers that, but the problem
exists for others' roots as well.

AFAICS, the minimal fix preserving the existing behaviour (where it doesn't
include oopsen, that is) would be this:
        * count tp->root and tp_c->hlist as separate references.  I.e.
have u32_init() set refcount to 2, not 1.
	* in u32_destroy() we always drop the former; in u32_destroy_hnode() -
the latter.

	That way we have *all* references contributing to refcount.  List
removal happens in u32_destroy_hnode() (called only when ->refcnt is 1)
an in u32_destroy() in case of tc_u_common going away, along with everything
reachable from it.  IOW, that way we know that u32_destroy_key() won't
free something still on the list (or pointed to by someone's ->root).

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 net/sched/cls_u32.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Jamal Hadi Salim Sept. 6, 2018, 10:21 a.m. UTC | #1

On 2018-09-05 3:04 p.m., Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via
> ->hlist and via ->tp_root together.  u32_destroy() drops the former and, in
> case when there had been links, leaves the sucker on the list.  As the result,
> there's nothing to protect it from getting freed once links are dropped.
> That also makes the "is it busy" check incapable of catching the root hnode -
> it *is* busy (there's a reference from tp), but we don't see it as something
> separate.  "Is it our root?" check partially covers that, but the problem
> exists for others' roots as well.
> 
> AFAICS, the minimal fix preserving the existing behaviour (where it doesn't
> include oopsen, that is) would be this:
>          * count tp->root and tp_c->hlist as separate references.  I.e.
> have u32_init() set refcount to 2, not 1.
> 	* in u32_destroy() we always drop the former; in u32_destroy_hnode() -
> the latter.
> 
> 	That way we have *all* references contributing to refcount.  List
> removal happens in u32_destroy_hnode() (called only when ->refcnt is 1)
> an in u32_destroy() in case of tc_u_common going away, along with everything
> reachable from it.  IOW, that way we know that u32_destroy_key() won't
> free something still on the list (or pointed to by someone's ->root).
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

For networking patches, subject should be reflective of tree and
subsystem. Example for this one:
"[PATCH net 1/7]:net: sched: cls_u32: fix hnode refcounting"
Also useful to have a cover letter summarizing the patchset
in 0/7. Otherwise

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

Al Viro Sept. 7, 2018, 2:35 a.m. UTC | #2

On Thu, Sep 06, 2018 at 06:21:09AM -0400, Jamal Hadi Salim wrote:

> For networking patches, subject should be reflective of tree and
> subsystem. Example for this one:
> "[PATCH net 1/7]:net: sched: cls_u32: fix hnode refcounting"
> Also useful to have a cover letter summarizing the patchset
> in 0/7. Otherwise
> 
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

Argh...  Unfortunately, there's this: in u32_delete() we have
        if (root_ht) {
                if (root_ht->refcnt > 1) {
                        *last = false;
                        goto ret;
                }
                if (root_ht->refcnt == 1) {
                        if (!ht_empty(root_ht)) {
                                *last = false;
                                goto ret;
                        }
                }
        }
and that would need to be updated.  However, that logics is bloody odd
to start with.  First of all, root_ht has come from
       struct tc_u_hnode *root_ht = rtnl_dereference(tp->root);
and the only place where it's ever modified is
        rcu_assign_pointer(tp->root, root_ht);
in u32_init(), where we'd bloody well checked that root_ht is non-NULL
(see
        if (root_ht == NULL)
                return -ENOBUFS;
upstream of that place) and where that assignment is inevitable on the
way to returning 0.  No matter what, if tp has passed u32_init() it
will have non-NULL ->root, forever.  And there is no way for tcf_proto
to be seen outside of tcf_proto_create() without ->init() having returned
0 - it gets freed before anyone sees it.

So this 'if (root_ht)' can't be false.  What's more, what the hell is the
whole thing checking?  We are in u32_delete().  It's called (as ->delete())
from tfilter_del_notify(), which is called from tc_del_tfilter().  If we
return 0 with *last true, we follow up calling tcf_proto_destroy().
OK, let's look at the logics in there:
	* if there are links to root hnode => false
	* if there's no links to root hnode and it has knodes => false
(BTW, if we ever get there with root_ht->refcnt < 1, we are obviously screwed)
	* if there is a tcf_proto sharing tp->data => false (i.e. any filters
with different prio - don't bother)
	* if tp is the only one with reference to tp->data and there are *any*
knodes => false.

Any extra links can come only from knodes in a non-empty hnode.  And it's not
a common case.  Shouldn't thIe whole thing be
	* shared tp->data => false
	* any non-empty hnode => false
instead?  Perhaps even with the knode counter in tp->data, avoiding any loops
in there, as well as the entire ht_empty()...

Now, in the very beginning of u32_delete() we have this:
        struct tc_u_hnode *ht = arg;
	
        if (ht == NULL)
                goto out;
OK, but the call of ->delete() is
        err = tp->ops->delete(tp, fh, last, extack);
and arg == NULL seen in u32_delete() means fh == NULL in tfilter_del_notify().
Which is called in
        if (!fh) {
		...
	} else {
                bool last;

                err = tfilter_del_notify(net, skb, n, tp, block,
                                         q, parent, fh, false, &last,
                                         extack);
How can we ever get there with NULL fh?

The whole thing makes very little sense; looks like it used to live in
u32_destroy() prior to commit 763dbf6328e41 ("net_sched: move the empty tp
check from ->destroy() to ->delete()"), but looking at the rationale in
that commit...  I don't see how it fixes anything - sure, now we remove
tcf_proto from the list before calling ->destroy().  Without any RCU delays
in between.  How could it possibly solve any issues with ->classify()
called in parallel with ->destroy()?  cls_u32 (at least these days)
does try to survive u32_destroy() in parallel with u32_classify();
if any other classifiers do not, they are still broken and that commit
has not done anything for them.

Anyway, adjusting 1/7 for that is trivial, but I would really like to
understand what that code is doing...  Comments?

Jamal Hadi Salim Sept. 7, 2018, 12:13 p.m. UTC | #3

On 2018-09-06 10:35 p.m., Al Viro wrote:
> On Thu, Sep 06, 2018 at 06:21:09AM -0400, Jamal Hadi Salim wrote:
[..]
> 
> Argh...  Unfortunately, there's this: in u32_delete() we have
>          if (root_ht) {
>                  if (root_ht->refcnt > 1) {
>                          *last = false;
>                          goto ret;
>                  }
>                  if (root_ht->refcnt == 1) {
>                          if (!ht_empty(root_ht)) {
>                                  *last = false;
>                                  goto ret;
>                          }
>                  }
>          }
> and that would need to be updated.  

It is not detrimental as you have it right now but
you are right an adjustment is needed...

Deleting of a root directly should not be allowed. But
you can flush a whole tp. Consider this:
--
sudo tc qdisc add dev $P ingress
sudo tc filter add dev $P parent ffff: protocol ip prio 10 \
u32 match ip protocol 1 0xff

Which creates root ht 800

You shouldnt be allowed to do this:
--
tc filter delete dev $P parent ffff: protocol ip prio 10 handle 800: u32
---

But you can delete the tp entirely as such:
---
tc filter delete dev $P parent ffff: protocol ip prio 10 u32
--

The later will go via the destroy() path and flush all filters.

You should also be able to delete individual filters. ex:
$tc filter del dev $P parent ffff: prio 10 handle 800:0:800 u32

Where that code you are referring to is important is when
the last filter deleted - we need the caller to know
and it destroys root.

i.e you should return last=true when the last filter is
deleted so root gets auto deleted (just like it was autocreated)

> However, that logics is bloody odd
> to start with.  First of all, root_ht has come from
>         struct tc_u_hnode *root_ht = rtnl_dereference(tp->root);
> and the only place where it's ever modified is
>          rcu_assign_pointer(tp->root, root_ht);
> in u32_init(), where we'd bloody well checked that root_ht is non-NULL
> (see
>          if (root_ht == NULL)
>                  return -ENOBUFS;
> upstream of that place) and where that assignment is inevitable on the
> way to returning 0.  No matter what, if tp has passed u32_init() it
> will have non-NULL ->root, forever.  And there is no way for tcf_proto
> to be seen outside of tcf_proto_create() without ->init() having returned
> 0 - it gets freed before anyone sees it.
> 

Yes, the check for root_ht is not necessary - but the check for the
last filter (and testing for last) is needed.

> So this 'if (root_ht)' can't be false.  What's more, what the hell is the
> whole thing checking?  We are in u32_delete().  It's called (as ->delete())
> from tfilter_del_notify(), which is called from tc_del_tfilter().  If we
> return 0 with *last true, we follow up calling tcf_proto_destroy().
> OK, let's look at the logics in there:
> 	* if there are links to root hnode => false
> 	* if there's no links to root hnode and it has knodes => false
> (BTW, if we ever get there with root_ht->refcnt < 1, we are obviously screwed)
> 	* if there is a tcf_proto sharing tp->data => false (i.e. any filters
> with different prio - don't bother)
> 	* if tp is the only one with reference to tp->data and there are *any*
> knodes => false.
> 
> Any extra links can come only from knodes in a non-empty hnode.  And it's not
> a common case.  Shouldn't thIe whole thing be
> 	* shared tp->data => false
> 	* any non-empty hnode => false
> instead?  Perhaps even with the knode counter in tp->data, avoiding any loops
> in there, as well as the entire ht_empty()...
> 
> Now, in the very beginning of u32_delete() we have this:
>          struct tc_u_hnode *ht = arg;
> 	
>          if (ht == NULL)
>                  goto out;
> OK, but the call of ->delete() is
>          err = tp->ops->delete(tp, fh, last, extack);
> and arg == NULL seen in u32_delete() means fh == NULL in tfilter_del_notify().
> Which is called in
>          if (!fh) {
> 		...
> 	} else {
>                  bool last;
> 
>                  err = tfilter_del_notify(net, skb, n, tp, block,
>                                           q, parent, fh, false, &last,
>                                           extack);
> How can we ever get there with NULL fh?
>

Try:
tc filter delete dev $P parent ffff: protocol ip prio 10 u32
tcm handle is 0, so will hit that code path.

> The whole thing makes very little sense; looks like it used to live in
> u32_destroy() prior to commit 763dbf6328e41 ("net_sched: move the empty tp
> check from ->destroy() to ->delete()"), but looking at the rationale in
> that commit...  I don't see how it fixes anything - sure, now we remove
> tcf_proto from the list before calling ->destroy().  Without any RCU delays
> in between.  How could it possibly solve any issues with ->classify()
> called in parallel with ->destroy()?  cls_u32 (at least these days)
> does try to survive u32_destroy() in parallel with u32_classify();
> if any other classifiers do not, they are still broken and that commit
> has not done anything for them.
> 
> Anyway, adjusting 1/7 for that is trivial, but I would really like to
> understand what that code is doing...  Comments?
> 

Refer to above.

cheers,
jamal

Jamal Hadi Salim Sept. 7, 2018, 12:33 p.m. UTC | #4

To clarify with an example i used to test
your patches:

#0 add ingress filter
$TC qdisc add dev $P ingress
#1 add filter
$TC filter add dev $P parent ffff: protocol ip prio 10 \
u32 match ip protocol 1 0xff
#2 display
$TC filter ls dev $P parent ffff:

#3 try to delete root
$TC filter delete dev $P parent ffff: protocol ip prio 10 \
handle 800: u32

#4 nothing changes..
$TC filter ls dev $P parent ffff:
#5 delete filter
$TC filter delete dev $P parent ffff: protocol ip prio 10 \
handle 800:0:800 u32
#6 filter gone but hash table still there..
$TC filter ls dev $P parent ffff:
#7 delete tp
$TC filter delete dev $P parent ffff: protocol ip prio 10 \
u32
#8 now it is gone..
$TC filter ls dev $P parent ffff:

your patches show #6 filter as still active.
We want it to look like #8

Hope this helps.

cheers,
jamal

Al Viro Sept. 8, 2018, 3:03 p.m. UTC | #5

On Fri, Sep 07, 2018 at 08:13:56AM -0400, Jamal Hadi Salim wrote:
> > 	} else {
> >                  bool last;
> > 
> >                  err = tfilter_del_notify(net, skb, n, tp, block,
> >                                           q, parent, fh, false, &last,
> >                                           extack);
> > How can we ever get there with NULL fh?
> > 
> 
> Try:
> tc filter delete dev $P parent ffff: protocol ip prio 10 u32
> tcm handle is 0, so will hit that code path.

Huh?  It will hit tcf_proto_destroy() (and thus u32_destroy()), but where will
it hit u32_delete()?  Sure, we have fh == NULL there; what happens next is
                if (t->tcm_handle == 0) {
                        tcf_chain_tp_remove(chain, &chain_info, tp);    
                        tfilter_notify(net, skb, n, tp, block, q, parent, fh,
                                       RTM_DELTFILTER, false);
                        tcf_proto_destroy(tp, extack);
and that's it.  IDGI...  Direct experiment shows that on e.g.
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 match ip protocol 1 0xff
tc filter delete dev eth0 parent ffff: protocol ip prio 10 u32
we get u32_destroy() called, with u32_destroy_hnode() called by it,
but no u32_delete() is called at all, let alone with ht == NULL...

[1/7] fix hnode refcounting

Commit Message

Comments

Patch