mbox series

[net,00/13] cls_u32 cleanups and fixes.

Message ID 20180909013132.3222-1-viro@ZenIV.linux.org.uk
Headers show
Series cls_u32 cleanups and fixes. | expand

Message

Al Viro Sept. 9, 2018, 1:31 a.m. UTC
From: Al Viro <viro@zeniv.linux.org.uk>

A series of net/sched/cls_u32.c cleanups and fixes:
	1) fix hnode refcounting.  Refcounting for tc_u_hnode is broken;
it's not hard to trigger oopsen (including one inside an interrupt handler,
with resulting panic) as well as memory corruption.  Definitely -stable
fodder.
	2) mark root hnode explicitly.  Consistent errors on attempts to
delete root hnodes.  Less serious than 1/13.
	3) disallow linking to root hnode.  Prohibit creating links to
root hnodes; not critical (nothing actually breaks if we do allow those),
but gets rid of surprising cases.
	4) make sure that divisor is a power of 2.  Missing validation -
divisor is documented as power of 2, but that's not actually enforced.
Results are moderately bogus (i.e. the kernel doesn't break), but rather
surprising.

Those are fixes, or at least can be argued to be such.  The rest are pure
cleanups:
	5) get rid of unused argument of u32_destroy_key()
	6) get rid of tc_u_knode ->tp
	7) get rid of tc_u_common ->rcu
Eliminate some unused fields.
	8) clean tc_u_common hashtable.
Hash lookups are best done with minimum of calculations per chain
element - comparing the field in each candidate with f(key) where
f() is actually a pure function is not nice, especially when
compiler doesn't see f() as such...  Better calculate f(key) once,
especially since we need its value to choose the hash chain in
the first place.
	9) pass tc_u_common to u32_set_parms() instead of tc_u_hnode
	10) the tp_c argument of u32_set_parms() is always tp->data
	11) get rid of hnode ->tp_c and tp_c argument of u32_set_parms()
Massage that ends with getting rid of a redundant field.
	12) keep track of knodes count in tc_u_common
	13) simplify the hell out u32_delete() emptiness check
Checking if a filter needs to be killed after u32_delete() can be
done much easier - the test is equivalent to "filter doesn't
have ->data shared with anyone else and it has no knodes left
in it" and keeping track of the number of knodes is trivial.

Comments

Jamal Hadi Salim Sept. 9, 2018, 12:58 p.m. UTC | #1
Since you have the momentum here: i noticed something
unusual while i was trying to craft a test that would
vet some of your changes. This has nothing to do with
your changes, same happens on my stock debian laptop
with kernel:
4.17.0-0.bpo.3-amd64 #1 SMP Debian 4.17.17-1~bpo9+1 (2018-08-27)

Looking at git - possibly introduced around the time u32
lockless was being introduced and maybe even earlier
than that. Unfortunately i dont have time to dig
further.

To reproduce what i am referring to, here's a setup:

$tc filter add dev eth0 parent ffff: protocol ip prio 102 u32 \
classid 1:2 match ip src 192.168.8.0/8
$tc filter replace dev eth0 parent ffff: protocol ip prio 102 \
handle 800:0:800 u32 classid 1:2 match ip src 1.1.0.0/24

u32_change() code path should have allowed changing of the
keynode.

cheers,
jamal
Al Viro Sept. 9, 2018, 2:15 p.m. UTC | #2
On Sun, Sep 09, 2018 at 08:58:50AM -0400, Jamal Hadi Salim wrote:
> 
> Since you have the momentum here: i noticed something
> unusual while i was trying to craft a test that would
> vet some of your changes. This has nothing to do with
> your changes, same happens on my stock debian laptop
> with kernel:
> 4.17.0-0.bpo.3-amd64 #1 SMP Debian 4.17.17-1~bpo9+1 (2018-08-27)
> 
> Looking at git - possibly introduced around the time u32
> lockless was being introduced and maybe even earlier
> than that.

It's always been that way, actually - before that point the old
knode simply got reused, which excluded any chance of changing
n->sel.

> Unfortunately i dont have time to dig
> further.
> 
> To reproduce what i am referring to, here's a setup:
> 
> $tc filter add dev eth0 parent ffff: protocol ip prio 102 u32 \
> classid 1:2 match ip src 192.168.8.0/8
> $tc filter replace dev eth0 parent ffff: protocol ip prio 102 \
> handle 800:0:800 u32 classid 1:2 match ip src 1.1.0.0/24
> 
> u32_change() code path should have allowed changing of the
> keynode.

Umm...  Interesting - TCA_U32_SEL is not the only thing that
gets ignored there; TCA_U32_MARK gets the same treatment.
And then there's a lovely question what to do with n->pf -
it's an array of n->sel.nkeys counters, and apparently we
want (at least in common cases) to avoid resetting those.

*If* we declare that ->nkeys mismatch means failure, it's
all relatively easy to implement.  Alternatively, we could
declare that selector change means resetting the stats.
Preferences?
Al Viro Sept. 9, 2018, 3:48 p.m. UTC | #3
On Sun, Sep 09, 2018 at 03:15:38PM +0100, Al Viro wrote:

> Umm...  Interesting - TCA_U32_SEL is not the only thing that
> gets ignored there; TCA_U32_MARK gets the same treatment.
> And then there's a lovely question what to do with n->pf -
> it's an array of n->sel.nkeys counters, and apparently we
> want (at least in common cases) to avoid resetting those.
> 
> *If* we declare that ->nkeys mismatch means failure, it's
> all relatively easy to implement.  Alternatively, we could
> declare that selector change means resetting the stats.
> Preferences?

BTW, shouldn't we issue u32_clear_hw_hnode() every time
we destroy an hnode?  It's done on u32_delete(), it's
done (for root ht) on u32_destroy(), but it's not done
for any other hnodes when you remove the entire (not shared)
filter.  Looks fishy...
Jamal Hadi Salim Sept. 10, 2018, 11:30 a.m. UTC | #4
On 2018-09-09 10:15 a.m., Al Viro wrote:

[..]

> Umm...  Interesting - TCA_U32_SEL is not the only thing that
> gets ignored there; TCA_U32_MARK gets the same treatment.
> And then there's a lovely question what to do with n->pf -
> it's an array of n->sel.nkeys counters, and apparently we
> want (at least in common cases) to avoid resetting those.

> *If* we declare that ->nkeys mismatch means failure, it's
> all relatively easy to implement.  Alternatively, we could
> declare that selector change means resetting the stats.
> Preferences?

I am now conflicted. I have sample scripts which showed
that at one point that worked. All of them seem to be
indicating the nkeys and stats never changed. So that
could be a starting point; however, if a policy changes
the match tuples then it is no longer the same rule imo.

Note: you can change the "actions" - of which the most
primitive is "set classid x:y" without changing what is
being matched. i.e changing the classid in that example
would work.

cheers,
jamal
Jamal Hadi Salim Sept. 10, 2018, 12:25 p.m. UTC | #5
On 2018-09-09 11:48 a.m., Al Viro wrote:

>
> BTW, shouldn't we issue u32_clear_hw_hnode() every time
> we destroy an hnode?  It's done on u32_delete(), it's
> done (for root ht) on u32_destroy(), but it's not done
> for any other hnodes when you remove the entire (not shared)
> filter.  Looks fishy...
> 

What you are saying makes sense, but that doesnt seem
like a new thing.
All hardware offload examples I have seen use root tables
only[1]. So I am not sure if they use any other tables
although the intel hardware at least seems very capable.
Look at ixgbe_main.c for example. Theres an explicit assumption
that root is always 0x800 but oresence of

+Cc some of the NIC vendor folks..



cheers,
jamal

[1]
Here's a script posted by someone at Intel(Sridhar?) a while back
that adds 2 filters, one with skip-sw and the other with skip-hw
flag.

---
    # add ingress qdisc
    tc qdisc add dev p4p1 ingress

    # enable hw tc offload.
    ethtool -K p4p1 hw-tc-offload on

    # add u32 filter with skip-sw flag.
    tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
       handle 800:0:1 u32 ht 800: flowid 800:1 \
       skip-sw \
       match ip src 192.168.1.0/24 \
       action drop

    # add u32 filter with skip-hw flag.
    tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
       handle 800:0:2 u32 ht 800: flowid 800:2 \
       skip-hw \
       match ip src 192.168.2.0/24 \
       action drop
----
Jamal Hadi Salim Sept. 10, 2018, 12:31 p.m. UTC | #6
On 2018-09-10 8:25 a.m., Jamal Hadi Salim wrote:
> On 2018-09-09 11:48 a.m., Al Viro wrote:
> 
>>
>> BTW, shouldn't we issue u32_clear_hw_hnode() every time
>> we destroy an hnode?  It's done on u32_delete(), it's
>> done (for root ht) on u32_destroy(), but it's not done
>> for any other hnodes when you remove the entire (not shared)
>> filter.  Looks fishy...
>>
> 
> What you are saying makes sense, but that doesnt seem
> like a new thing.
> All hardware offload examples I have seen use root tables
> only[1]. So I am not sure if they use any other tables
> although the intel hardware at least seems very capable.
> Look at ixgbe_main.c for example. Theres an explicit assumption
> that root is always 0x800 but oresence of
>

"..but presence of other tables can be handled"

cheers,
jamal
David Miller Sept. 12, 2018, 6:15 a.m. UTC | #7
From: Al Viro <viro@ZenIV.linux.org.uk>
Date: Sun,  9 Sep 2018 02:31:19 +0100

> A series of net/sched/cls_u32.c cleanups and fixes:

Al, let's separate the serious bug fixes into a separate series
targetting net.  Then we can do all of the cleanups and
simplifications in net-next once I merge net into there.

Thank you.