[net-next,00/13] Control action percpu counters allocation by netlink flag

Message ID	20191022141804.27639-1-vladbu@mellanox.com
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Vlad Buslov <vladbu@mellanox.com> To: netdev@vger.kernel.org Cc: jhs@mojatatu.com, xiyou.wangcong@gmail.com, jiri@resnulli.us, davem@davemloft.net, mleitner@redhat.com, dcaratti@redhat.com, Vlad Buslov <vladbu@mellanox.com> Subject: [PATCH net-next 00/13] Control action percpu counters allocation by netlink flag Date: Tue, 22 Oct 2019 17:17:51 +0300 Message-Id: <20191022141804.27639-1-vladbu@mellanox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	Control action percpu counters allocation by netlink flag \| expand [net-next,00/13] Control action percpu counters allocation by netlink flag [net-next,01/13] net: sched: extract common action counters update code into function [net-next,02/13] net: sched: extract bstats update code into function [net-next,03/13] net: sched: extract qstats update code into functions [net-next,04/13] net: sched: don't expose action qstats to skb_tc_reinsert() [net-next,05/13] net: sched: modify stats helper functions to support regular stats [net-next,06/13] net: sched: add action fast initialization flag [net-next,07/13] net: sched: act_gact: support fast init flag to skip pcpu allocation [net-next,08/13] net: sched: act_csum: support fast init flag to skip pcpu allocation [net-next,09/13] net: sched: act_mirred: support fast init flag to skip pcpu alloc [net-next,10/13] net: sched: tunnel_key: support fast init flag to skip pcpu alloc [net-next,11/13] net: sched: act_vlan: support fast init flag to skip pcpu allocation [net-next,12/13] net: sched: act_ct: support fast init flag to skip pcpu allocation [net-next,13/13] tc-testing: implement tests for new fast_init action flag

Vlad Buslov Oct. 22, 2019, 2:17 p.m. UTC

Currently, significant fraction of CPU time during TC filter allocation
is spent in percpu allocator. Moreover, percpu allocator is protected
with single global mutex which negates any potential to improve its
performance by means of recent developments in TC filter update API that
removed rtnl lock for some Qdiscs and classifiers. In order to
significantly improve filter update rate and reduce memory usage we
would like to allow users to skip percpu counters allocation for
specific action if they don't expect high traffic rate hitting the
action, which is a reasonable expectation for hardware-offloaded setup.
In that case any potential gains to software fast-path performance
gained by usage of percpu-allocated counters compared to regular integer
counters protected by spinlock are not important, but amount of
additional CPU and memory consumed by them is significant.

In order to allow configuring action counters allocation type at
runtime, implement following changes:

- Implement helper functions to update the action counters and use them
  in affected actions instead of updating counters directly. This steps
  abstracts actions implementation from counter types that are being
  used for particular action instance at runtime.

- Modify the new helpers to use percpu counters if they were allocated
  during action initialization and use regular counters otherwise.

- Extend actions that are used for hardware offloads with optional
  netlink 32bit flags field. Add TCA_ACT_FLAGS_FAST_INIT action flag and
  update affected actions to not allocate percpu counters when the flag
  is set.

With this changes users that prefer action update slow-path speed over
software fast-path speed can dynamically request actions to skip percpu
counters allocation without affecting other users.

Now, lets look at actual performance gains provided by this change.
Simple test is used to measure insertion rate - iproute2 TC is executed
in parallel by xargs in batch mode, its total execution time is measured
by shell builtin "time" command. The command runs 20 concurrent tc
instances, each with its own batch file with 100k rules:

$ time ls add* | xargs -n 1 -P 20 sudo tc -b

Two main rule profiles are tested. First is simple L2 flower classifier
with single gact drop action. The configuration is chosen as worst case
scenario because with single-action rules pressure on percpu allocator
is minimized. Example rule:

filter add dev ens1f0 protocol ip ingress prio 1 handle 1 flower skip_hw
    src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 action drop

Second profile is typical real-world scenario that uses flower
classifier with some L2-4 fields and two actions (tunnel_key+mirred).
Example rule:

filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower
    skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip
    192.168.111.1 dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port
    1 action tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port
    4789 action mirred egress redirect dev vxlan1

 Profile           |        percpu |     fast_init | X improvement 
                   | (k rules/sec) | (k rules/sec) |               
-------------------+---------------+---------------+---------------
 Gact drop         |           203 |           259 |          1.28 
 tunnel_key+mirred |            92 |           204 |          2.22 

For simple drop action removing percpu allocation leads to ~25%
insertion rate improvement. Perf profiles highlights the bottlenecks.

Perf profile of run with percpu allocation (gact drop):

+ 89.11% 0.48% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 88.58% 0.04% tc [kernel.vmlinux] [k] do_syscall_64
+ 87.50% 0.04% tc libc-2.29.so [.] __libc_sendmsg
+ 86.96% 0.04% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 86.85% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 86.60% 0.05% tc [kernel.vmlinux] [k] sock_sendmsg
+ 86.55% 0.12% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 86.04% 0.13% tc [kernel.vmlinux] [k] netlink_unicast
+ 85.42% 0.03% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 84.68% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 84.56% 0.24% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 75.73% 0.65% tc [cls_flower] [k] fl_change
+ 71.30% 0.03% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 71.27% 0.13% tc [kernel.vmlinux] [k] tcf_action_init
+ 71.06% 0.01% tc [kernel.vmlinux] [k] tcf_action_init_1
+ 70.41% 0.04% tc [act_gact] [k] tcf_gact_init
+ 53.59% 1.21% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 52.34% 0.34% tc [kernel.vmlinux] [k] tcf_idr_create
- 51.23% 2.17% tc [kernel.vmlinux] [k] pcpu_alloc
  - 49.05% pcpu_alloc
    + 39.35% __mutex_lock.isra.0 4.99% memset_erms
    + 2.16% pcpu_alloc_area
  + 2.17% __libc_sendmsg
+ 45.89% 44.33% tc [kernel.vmlinux] [k] osq_lock
+ 9.94% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 7.76% 0.00% tc [kernel.vmlinux] [k] tcf_idr_insert
+ 6.50% 0.03% tc [kernel.vmlinux] [k] tfilter_notify
+ 6.24% 6.11% tc [kernel.vmlinux] [k] mutex_spin_on_owner
+ 5.73% 5.32% tc [kernel.vmlinux] [k] memset_erms
+ 5.31% 0.18% tc [kernel.vmlinux] [k] tcf_fill_node

Here bottleneck is clearly in pcpu_alloc() function that takes more than
half CPU time, which is mostly wasted busy-waiting for internal percpu
allocator global lock.

With percpu allocation removed (gact drop):

+ 87.50% 0.51% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 86.94% 0.07% tc [kernel.vmlinux] [k] do_syscall_64
+ 85.75% 0.04% tc libc-2.29.so [.] __libc_sendmsg
+ 85.00% 0.07% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 84.84% 0.07% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 84.59% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
+ 84.58% 0.14% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 83.95% 0.12% tc [kernel.vmlinux] [k] netlink_unicast
+ 83.34% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 82.39% 0.12% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 82.16% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 75.13% 0.84% tc [cls_flower] [k] fl_change
+ 69.92% 0.05% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 69.87% 0.11% tc [kernel.vmlinux] [k] tcf_action_init
+ 69.61% 0.02% tc [kernel.vmlinux] [k] tcf_action_init_1
- 68.80% 0.10% tc [act_gact] [k] tcf_gact_init
  - 68.70% tcf_gact_init
    + 36.08% tcf_idr_check_alloc
    + 31.88% tcf_idr_insert
+ 63.72% 0.58% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 58.80% 56.68% tc [kernel.vmlinux] [k] osq_lock
+ 36.08% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 31.88% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert

The gact actions (like all other actions types) are inserted in single
idr instance protected by global (per namespace) lock that becomes new
bottleneck with such simple rule profile and prevents achieving 2x+
performance increase that can be expected by looking at profiling data
for insertion action with percpu counter.

Perf profile of run with percpu allocation (tunnel_key+mirred):

+ 91.95% 0.21% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 91.74% 0.06% tc [kernel.vmlinux] [k] do_syscall_64
+ 90.74% 0.01% tc libc-2.29.so [.] __libc_sendmsg
+ 90.52% 0.01% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 90.50% 0.04% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 90.41% 0.02% tc [kernel.vmlinux] [k] sock_sendmsg
+ 90.38% 0.04% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 90.10% 0.06% tc [kernel.vmlinux] [k] netlink_unicast
+ 89.76% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 89.28% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 89.15% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 83.41% 0.33% tc [cls_flower] [k] fl_change
+ 81.17% 0.04% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 81.13% 0.06% tc [kernel.vmlinux] [k] tcf_action_init
+ 81.04% 0.04% tc [kernel.vmlinux] [k] tcf_action_init_1
- 73.59% 2.16% tc [kernel.vmlinux] [k] pcpu_alloc
  - 71.42% pcpu_alloc
    + 61.41% __mutex_lock.isra.0 5.02% memset_erms
    + 2.93% pcpu_alloc_area
  + 2.16% __libc_sendmsg
+ 63.58% 0.17% tc [kernel.vmlinux] [k] tcf_idr_create
+ 63.40% 0.60% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 57.85% 56.38% tc [kernel.vmlinux] [k] osq_lock
+ 46.27% 0.13% tc [act_tunnel_key] [k] tunnel_key_init
+ 34.26% 0.02% tc [act_mirred] [k] tcf_mirred_init
+ 10.99% 0.00% tc [kernel.vmlinux] [k] dst_cache_init
+ 5.32% 5.11% tc [kernel.vmlinux] [k] memset_erms

With two times more actions pressure on percpu allocator doubles, so now
it takes ~74% of CPU execution time.

With percpu allocation removed (tunnel_key+mirred):

+ 86.02% 0.50% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 85.51% 0.12% tc [kernel.vmlinux] [k] do_syscall_64
+ 84.40% 0.03% tc libc-2.29.so [.] __libc_sendmsg
+ 83.84% 0.03% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 83.72% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 83.56% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
+ 83.50% 0.08% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 83.02% 0.17% tc [kernel.vmlinux] [k] netlink_unicast
+ 82.48% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 81.89% 0.11% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 81.71% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 73.99% 0.63% tc [cls_flower] [k] fl_change
+ 69.72% 0.00% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 69.72% 0.09% tc [kernel.vmlinux] [k] tcf_action_init
+ 69.53% 0.05% tc [kernel.vmlinux] [k] tcf_action_init_1
+ 53.08% 0.91% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 45.52% 43.99% tc [kernel.vmlinux] [k] osq_lock
- 36.02% 0.21% tc [act_tunnel_key] [k] tunnel_key_init
  - 35.81% tunnel_key_init
    + 15.95% tcf_idr_check_alloc
    + 13.91% tcf_idr_insert
    - 4.70% dst_cache_init
      + 4.68% pcpu_alloc
+ 33.22% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 32.34% 0.05% tc [act_mirred] [k] tcf_mirred_init
+ 28.24% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert
+ 7.79% 0.05% tc [kernel.vmlinux] [k] idr_alloc_u32
+ 7.67% 7.35% tc [kernel.vmlinux] [k] idr_get_free
+ 6.46% 6.22% tc [kernel.vmlinux] [k] mutex_spin_on_owner
+ 5.11% 0.05% tc [kernel.vmlinux] [k] tfilter_notify

With percpu allocation removed insertion rate is increased by ~120%.
Such rule profile scales much better than simple single action because
both types of actions were competing for single lock in percpu
allocator, but not for action idr lock, which is per-action. Note that
percpu allocator is still used by dst_cache in tunnel_key actions and
consumes 4.68% CPU time. Dst_cache seems like good opportunity for
further insertion rate optimization but is not addressed by this change.

Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory

Patches applied on top of net-next branch:

commit 2203cbf2c8b58a1e3bef98c47531d431d11639a0 (net-next) Author:
Russell King <rmk+kernel@armlinux.org.uk> Date: Tue Oct 15 11:38:39 2019
+0100

net: sfp: move fwnode parsing into sfp-bus layer

Vlad Buslov (13):
  net: sched: extract common action counters update code into function
  net: sched: extract bstats update code into function
  net: sched: extract qstats update code into functions
  net: sched: don't expose action qstats to skb_tc_reinsert()
  net: sched: modify stats helper functions to support regular stats
  net: sched: add action fast initialization flag
  net: sched: act_gact: support fast init flag to skip pcpu allocation
  net: sched: act_csum: support fast init flag to skip pcpu allocation
  net: sched: act_mirred: support fast init flag to skip pcpu alloc
  net: sched: tunnel_key: support fast init flag to skip pcpu alloc
  net: sched: act_vlan: support fast init flag to skip pcpu allocation
  net: sched: act_ct: support fast init flag to skip pcpu allocation
  tc-testing: implement tests for new fast_init action flag

 include/net/act_api.h                         | 47 ++++++++++++++++++-
 include/net/pkt_cls.h                         |  3 ++
 include/net/sch_generic.h                     | 12 +----
 include/uapi/linux/pkt_cls.h                  |  8 ++++
 include/uapi/linux/tc_act/tc_csum.h           |  1 +
 include/uapi/linux/tc_act/tc_ct.h             |  1 +
 include/uapi/linux/tc_act/tc_gact.h           |  1 +
 include/uapi/linux/tc_act/tc_mirred.h         |  1 +
 include/uapi/linux/tc_act/tc_tunnel_key.h     |  1 +
 include/uapi/linux/tc_act/tc_vlan.h           |  1 +
 net/sched/act_api.c                           | 28 ++++++++++-
 net/sched/act_bpf.c                           |  2 +-
 net/sched/act_connmark.c                      |  3 +-
 net/sched/act_csum.c                          | 22 +++++++--
 net/sched/act_ct.c                            | 27 +++++++----
 net/sched/act_ctinfo.c                        |  3 +-
 net/sched/act_gact.c                          | 32 ++++++++-----
 net/sched/act_ife.c                           |  2 +-
 net/sched/act_ipt.c                           |  8 ++--
 net/sched/act_mirred.c                        | 30 ++++++++----
 net/sched/act_mpls.c                          |  2 +-
 net/sched/act_nat.c                           |  3 +-
 net/sched/act_pedit.c                         |  3 +-
 net/sched/act_police.c                        |  7 +--
 net/sched/act_sample.c                        |  2 +-
 net/sched/act_simple.c                        |  3 +-
 net/sched/act_skbedit.c                       |  2 +-
 net/sched/act_skbmod.c                        |  2 +-
 net/sched/act_tunnel_key.c                    | 20 ++++++--
 net/sched/act_vlan.c                          | 26 ++++++----
 .../tc-testing/tc-tests/actions/csum.json     | 24 ++++++++++
 .../tc-testing/tc-tests/actions/ct.json       | 24 ++++++++++
 .../tc-testing/tc-tests/actions/gact.json     | 24 ++++++++++
 .../tc-testing/tc-tests/actions/mirred.json   | 24 ++++++++++
 .../tc-tests/actions/tunnel_key.json          | 24 ++++++++++
 .../tc-testing/tc-tests/actions/vlan.json     | 24 ++++++++++
 36 files changed, 367 insertions(+), 80 deletions(-)

Marcelo Leitner Oct. 22, 2019, 2:35 p.m. UTC | #1

On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
> Currently, significant fraction of CPU time during TC filter allocation
> is spent in percpu allocator. Moreover, percpu allocator is protected
> with single global mutex which negates any potential to improve its
> performance by means of recent developments in TC filter update API that
> removed rtnl lock for some Qdiscs and classifiers. In order to
> significantly improve filter update rate and reduce memory usage we
> would like to allow users to skip percpu counters allocation for
> specific action if they don't expect high traffic rate hitting the
> action, which is a reasonable expectation for hardware-offloaded setup.
> In that case any potential gains to software fast-path performance
> gained by usage of percpu-allocated counters compared to regular integer
> counters protected by spinlock are not important, but amount of
> additional CPU and memory consumed by them is significant.

Yes!

I wonder how this can play together with conntrack offloading.  With
it the sw datapath will be more used, as a conntrack entry can only be
offloaded after the handshake.  That said, the host can have to
process quite some handshakes in sw datapath.  Seems OvS can then just
not set this flag in act_ct (and others for this rule), and such cases
will be able to leverage the percpu stats.  Right?

> allocator, but not for action idr lock, which is per-action. Note that
> percpu allocator is still used by dst_cache in tunnel_key actions and
> consumes 4.68% CPU time. Dst_cache seems like good opportunity for
> further insertion rate optimization but is not addressed by this change.

I vented this idea re dst_cache last week with Paolo. He sent me a
draft patch but I didn't test it yet.

Thanks,
Marcelo

Vlad Buslov Oct. 22, 2019, 2:52 p.m. UTC | #2

On Tue 22 Oct 2019 at 17:35, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
> On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
>> Currently, significant fraction of CPU time during TC filter allocation
>> is spent in percpu allocator. Moreover, percpu allocator is protected
>> with single global mutex which negates any potential to improve its
>> performance by means of recent developments in TC filter update API that
>> removed rtnl lock for some Qdiscs and classifiers. In order to
>> significantly improve filter update rate and reduce memory usage we
>> would like to allow users to skip percpu counters allocation for
>> specific action if they don't expect high traffic rate hitting the
>> action, which is a reasonable expectation for hardware-offloaded setup.
>> In that case any potential gains to software fast-path performance
>> gained by usage of percpu-allocated counters compared to regular integer
>> counters protected by spinlock are not important, but amount of
>> additional CPU and memory consumed by them is significant.
>
> Yes!
>
> I wonder how this can play together with conntrack offloading.  With
> it the sw datapath will be more used, as a conntrack entry can only be
> offloaded after the handshake.  That said, the host can have to
> process quite some handshakes in sw datapath.  Seems OvS can then just
> not set this flag in act_ct (and others for this rule), and such cases
> will be able to leverage the percpu stats.  Right?

The flag is set per each actions instance so client can chose not to use
the flag in case-by-case basis. Conntrack use case requires further
investigation since I'm not entirely convinced that handling first few
packets in sw (before connection reaches established state and is
offloaded) warrants having percpu counter.

>
>> allocator, but not for action idr lock, which is per-action. Note that
>> percpu allocator is still used by dst_cache in tunnel_key actions and
>> consumes 4.68% CPU time. Dst_cache seems like good opportunity for
>> further insertion rate optimization but is not addressed by this change.
>
> I vented this idea re dst_cache last week with Paolo. He sent me a
> draft patch but I didn't test it yet.

Looking forward to it!

>
> Thanks,
> Marcelo

Marcelo Leitner Oct. 22, 2019, 3:15 p.m. UTC | #3

On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
> - Extend actions that are used for hardware offloads with optional
>   netlink 32bit flags field. Add TCA_ACT_FLAGS_FAST_INIT action flag and
>   update affected actions to not allocate percpu counters when the flag
>   is set.

I just went over all the patches and they mostly make sense to me. So
far the only point I'm uncertain of is the naming of the flag,
"fast_init".  That is not clear on what it does and can be overloaded
with other stuff later and we probably don't want that.

Say, for example, we want percpu counters but to disable allocating
the stats for hw, to make the counter in 28169abadb08 ("net/sched: Add
hardware specific counters to TC actions") optional.

So what about:
TCA_ACT_FLAGS_NO_PERCPU_STATS
TCA_ACT_FLAGS_NO_HW_STATS (this one to be done on a subsequent patchset, yes)
?

  Marcelo

Vlad Buslov Oct. 22, 2019, 3:52 p.m. UTC | #4

On Tue 22 Oct 2019 at 18:15, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
> On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
>> - Extend actions that are used for hardware offloads with optional
>>   netlink 32bit flags field. Add TCA_ACT_FLAGS_FAST_INIT action flag and
>>   update affected actions to not allocate percpu counters when the flag
>>   is set.
>
> I just went over all the patches and they mostly make sense to me. So
> far the only point I'm uncertain of is the naming of the flag,
> "fast_init".  That is not clear on what it does and can be overloaded
> with other stuff later and we probably don't want that.

I intentionally named it like that because I do want to overload it with
other stuff in future, instead of adding new flag value for every single
small optimization we might come up with :)

Also, I didn't want to hardcode implementation details into UAPI that we
will have to maintain for long time after percpu allocator in kernel is
potentially replaced with something new and better (like idr is being
replaced with xarray now, for example)

Anyway, lets see what other people think. I'm open to changing it.

>
> Say, for example, we want percpu counters but to disable allocating
> the stats for hw, to make the counter in 28169abadb08 ("net/sched: Add
> hardware specific counters to TC actions") optional.
>
> So what about:
> TCA_ACT_FLAGS_NO_PERCPU_STATS
> TCA_ACT_FLAGS_NO_HW_STATS (this one to be done on a subsequent patchset, yes)
> ?
>
>   Marcelo

Marcelo Leitner Oct. 22, 2019, 5:09 p.m. UTC | #5

On Tue, Oct 22, 2019 at 03:52:31PM +0000, Vlad Buslov wrote:
> 
> On Tue 22 Oct 2019 at 18:15, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
> > On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
> >> - Extend actions that are used for hardware offloads with optional
> >>   netlink 32bit flags field. Add TCA_ACT_FLAGS_FAST_INIT action flag and
> >>   update affected actions to not allocate percpu counters when the flag
> >>   is set.
> >
> > I just went over all the patches and they mostly make sense to me. So
> > far the only point I'm uncertain of is the naming of the flag,
> > "fast_init".  That is not clear on what it does and can be overloaded
> > with other stuff later and we probably don't want that.
> 
> I intentionally named it like that because I do want to overload it with
> other stuff in future, instead of adding new flag value for every single
> small optimization we might come up with :)

Hah :-)

> 
> Also, I didn't want to hardcode implementation details into UAPI that we
> will have to maintain for long time after percpu allocator in kernel is
> potentially replaced with something new and better (like idr is being
> replaced with xarray now, for example)

I see. OTOH, this also means that the UAPI here would be unstable
(different meanings over time for the same call), and hopefully new
behaviors would always be backwards compatible.

> 
> Anyway, lets see what other people think. I'm open to changing it.
> 
> >
> > Say, for example, we want percpu counters but to disable allocating
> > the stats for hw, to make the counter in 28169abadb08 ("net/sched: Add
> > hardware specific counters to TC actions") optional.
> >
> > So what about:
> > TCA_ACT_FLAGS_NO_PERCPU_STATS
> > TCA_ACT_FLAGS_NO_HW_STATS (this one to be done on a subsequent patchset, yes)
> > ?
> >
> >   Marcelo
>

Roman Mashak Oct. 22, 2019, 6:17 p.m. UTC | #6

Vlad Buslov <vladbu@mellanox.com> writes:

> On Tue 22 Oct 2019 at 17:35, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
>> On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
>>> Currently, significant fraction of CPU time during TC filter allocation
>>> is spent in percpu allocator. Moreover, percpu allocator is protected
>>> with single global mutex which negates any potential to improve its
>>> performance by means of recent developments in TC filter update API that
>>> removed rtnl lock for some Qdiscs and classifiers. In order to
>>> significantly improve filter update rate and reduce memory usage we
>>> would like to allow users to skip percpu counters allocation for
>>> specific action if they don't expect high traffic rate hitting the
>>> action, which is a reasonable expectation for hardware-offloaded setup.
>>> In that case any potential gains to software fast-path performance
>>> gained by usage of percpu-allocated counters compared to regular integer
>>> counters protected by spinlock are not important, but amount of
>>> additional CPU and memory consumed by them is significant.
>>
>> Yes!
>>
>> I wonder how this can play together with conntrack offloading.  With
>> it the sw datapath will be more used, as a conntrack entry can only be
>> offloaded after the handshake.  That said, the host can have to
>> process quite some handshakes in sw datapath.  Seems OvS can then just
>> not set this flag in act_ct (and others for this rule), and such cases
>> will be able to leverage the percpu stats.  Right?
>
> The flag is set per each actions instance so client can chose not to use
> the flag in case-by-case basis. Conntrack use case requires further
> investigation since I'm not entirely convinced that handling first few
> packets in sw (before connection reaches established state and is
> offloaded) warrants having percpu counter.

Hi Vlad,

Did you consider using TCA_ROOT_FLAGS instead of adding another
per-action 32-bit flag?

Vlad Buslov Oct. 23, 2019, 6:38 a.m. UTC | #7

On Tue 22 Oct 2019 at 21:17, Roman Mashak <mrv@mojatatu.com> wrote:
> Vlad Buslov <vladbu@mellanox.com> writes:
>
>> On Tue 22 Oct 2019 at 17:35, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
>>> On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
>>>> Currently, significant fraction of CPU time during TC filter allocation
>>>> is spent in percpu allocator. Moreover, percpu allocator is protected
>>>> with single global mutex which negates any potential to improve its
>>>> performance by means of recent developments in TC filter update API that
>>>> removed rtnl lock for some Qdiscs and classifiers. In order to
>>>> significantly improve filter update rate and reduce memory usage we
>>>> would like to allow users to skip percpu counters allocation for
>>>> specific action if they don't expect high traffic rate hitting the
>>>> action, which is a reasonable expectation for hardware-offloaded setup.
>>>> In that case any potential gains to software fast-path performance
>>>> gained by usage of percpu-allocated counters compared to regular integer
>>>> counters protected by spinlock are not important, but amount of
>>>> additional CPU and memory consumed by them is significant.
>>>
>>> Yes!
>>>
>>> I wonder how this can play together with conntrack offloading.  With
>>> it the sw datapath will be more used, as a conntrack entry can only be
>>> offloaded after the handshake.  That said, the host can have to
>>> process quite some handshakes in sw datapath.  Seems OvS can then just
>>> not set this flag in act_ct (and others for this rule), and such cases
>>> will be able to leverage the percpu stats.  Right?
>>
>> The flag is set per each actions instance so client can chose not to use
>> the flag in case-by-case basis. Conntrack use case requires further
>> investigation since I'm not entirely convinced that handling first few
>> packets in sw (before connection reaches established state and is
>> offloaded) warrants having percpu counter.
>
> Hi Vlad,
>
> Did you consider using TCA_ROOT_FLAGS instead of adding another
> per-action 32-bit flag?

Hi Roman,

I considered it, but didn't find good way to implement my change with
TCA_ROOT_FLAGS. I needed some flags to be per-action for following
reasons:

1. Not all actions support the flag (only implemented for hw offloaded
   actions).

2. TCA_ROOT_FLAGS is act API specific and I need this to work when
   actions are created when actions are created with filters through cls
   API. I guess I could have changed tcf_action_init_1() to require
   having TCA_ROOT_FLAGS before actions attribute and then pass obtained
   value to act_ops->init() as additional argument, but it sounds more
   complex and ugly.

Regards,
Vlad

Vlad Buslov Oct. 23, 2019, 6:42 a.m. UTC | #8

On Tue 22 Oct 2019 at 20:09, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
> On Tue, Oct 22, 2019 at 03:52:31PM +0000, Vlad Buslov wrote:
>>
>> On Tue 22 Oct 2019 at 18:15, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
>> > On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
>> >> - Extend actions that are used for hardware offloads with optional
>> >>   netlink 32bit flags field. Add TCA_ACT_FLAGS_FAST_INIT action flag and
>> >>   update affected actions to not allocate percpu counters when the flag
>> >>   is set.
>> >
>> > I just went over all the patches and they mostly make sense to me. So
>> > far the only point I'm uncertain of is the naming of the flag,
>> > "fast_init".  That is not clear on what it does and can be overloaded
>> > with other stuff later and we probably don't want that.
>>
>> I intentionally named it like that because I do want to overload it with
>> other stuff in future, instead of adding new flag value for every single
>> small optimization we might come up with :)
>
> Hah :-)
>
>>
>> Also, I didn't want to hardcode implementation details into UAPI that we
>> will have to maintain for long time after percpu allocator in kernel is
>> potentially replaced with something new and better (like idr is being
>> replaced with xarray now, for example)
>
> I see. OTOH, this also means that the UAPI here would be unstable
> (different meanings over time for the same call), and hopefully new
> behaviors would always be backwards compatible.

This flag doesn't change any userland-visible behavior, just suggests
different performance trade off (insertion rate for sw packet processing
speed). Counters continue to work with or without the flag. Any
optimization that actually modifies cls or act API behavior will have to
use dedicated flag value.

>
>>
>> Anyway, lets see what other people think. I'm open to changing it.
>>
>> >
>> > Say, for example, we want percpu counters but to disable allocating
>> > the stats for hw, to make the counter in 28169abadb08 ("net/sched: Add
>> > hardware specific counters to TC actions") optional.
>> >
>> > So what about:
>> > TCA_ACT_FLAGS_NO_PERCPU_STATS
>> > TCA_ACT_FLAGS_NO_HW_STATS (this one to be done on a subsequent patchset, yes)
>> > ?
>> >
>> >   Marcelo
>>

Jamal Hadi Salim Oct. 23, 2019, 12:49 p.m. UTC | #9

Hi Vlad,

On 2019-10-22 10:17 a.m., Vlad Buslov wrote:
> Currently, significant fraction of CPU time during TC filter allocation
> is spent in percpu allocator. Moreover, percpu allocator is protected
> with single global mutex which negates any potential to improve its
> performance by means of recent developments in TC filter update API that
> removed rtnl lock for some Qdiscs and classifiers. In order to
> significantly improve filter update rate and reduce memory usage we
> would like to allow users to skip percpu counters allocation for
> specific action if they don't expect high traffic rate hitting the
> action, which is a reasonable expectation for hardware-offloaded setup.
> In that case any potential gains to software fast-path performance
> gained by usage of percpu-allocated counters compared to regular integer
> counters protected by spinlock are not important, but amount of
> additional CPU and memory consumed by them is significant.

Great to see this becoming low hanging on the fruit tree
after your improvements.
Note: had a discussion a few years back with Eric D.(on Cc)
when i was trying to improve action dumping; what you are seeing
was very visible when doing a large batch creation of actions.
At the time i was thinking of amortizing the cost of that mutex
in a batch action create i.e you ask the per cpu allocator
to alloc a batch of the stats instead of singular.

I understand your use case being different since it is for h/w
offload. If you have time can you test with batching many actions
and seeing the before/after improvement?

Note: even for h/w offload it makes sense to first create the actions
then bind to filters (in my world thats what we end up doing).
If we can improve the first phase it is a win for both s/w and hw use
cases.

Question:
Given TCA_ACT_FLAGS_FAST_INIT is common to all actions would it make
sense to use Could you have used a TLV in the namespace of TCA_ACT_MAX
(outer TLV)? You will have to pass a param to ->init().

cheers,
jamal

Jamal Hadi Salim Oct. 23, 2019, 1:02 p.m. UTC | #10

I shouldve read the thread backward. My earlier email was asking
similar question to Roman.

On 2019-10-23 2:38 a.m., Vlad Buslov wrote:
> 
> On Tue 22 Oct 2019 at 21:17, Roman Mashak <mrv@mojatatu.com> wrote:

> Hi Roman,
> 
> I considered it, but didn't find good way to implement my change with
> TCA_ROOT_FLAGS. I needed some flags to be per-action for following
> reasons:
> 
> 1. Not all actions support the flag (only implemented for hw offloaded
>     actions).
> 
 >
> 2. TCA_ROOT_FLAGS is act API specific and I need this to work when
>     actions are created when actions are created with filters through cls
>     API. I guess I could have changed tcf_action_init_1() to require
>     having TCA_ROOT_FLAGS before actions attribute and then pass obtained
>     value to act_ops->init() as additional argument, but it sounds more
>     complex and ugly.

I really shouldve read the thread backwards;->
The question is if this uglier than introducing a new TLV for every
action.

cheers,
jamal

Vlad Buslov Oct. 23, 2019, 1:04 p.m. UTC | #11

On Wed 23 Oct 2019 at 15:49, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> Hi Vlad,
>
> On 2019-10-22 10:17 a.m., Vlad Buslov wrote:
>> Currently, significant fraction of CPU time during TC filter allocation
>> is spent in percpu allocator. Moreover, percpu allocator is protected
>> with single global mutex which negates any potential to improve its
>> performance by means of recent developments in TC filter update API that
>> removed rtnl lock for some Qdiscs and classifiers. In order to
>> significantly improve filter update rate and reduce memory usage we
>> would like to allow users to skip percpu counters allocation for
>> specific action if they don't expect high traffic rate hitting the
>> action, which is a reasonable expectation for hardware-offloaded setup.
>> In that case any potential gains to software fast-path performance
>> gained by usage of percpu-allocated counters compared to regular integer
>> counters protected by spinlock are not important, but amount of
>> additional CPU and memory consumed by them is significant.
>
> Great to see this becoming low hanging on the fruit tree
> after your improvements.
> Note: had a discussion a few years back with Eric D.(on Cc)
> when i was trying to improve action dumping; what you are seeing
> was very visible when doing a large batch creation of actions.
> At the time i was thinking of amortizing the cost of that mutex
> in a batch action create i.e you ask the per cpu allocator
> to alloc a batch of the stats instead of singular.
>
> I understand your use case being different since it is for h/w
> offload. If you have time can you test with batching many actions
> and seeing the before/after improvement?

Will do.

>
> Note: even for h/w offload it makes sense to first create the actions
> then bind to filters (in my world thats what we end up doing).
> If we can improve the first phase it is a win for both s/w and hw use
> cases.
>
> Question:
> Given TCA_ACT_FLAGS_FAST_INIT is common to all actions would it make
> sense to use Could you have used a TLV in the namespace of TCA_ACT_MAX
> (outer TLV)? You will have to pass a param to ->init().

It is not common for all actions. I omitted modifying actions that are
not offloaded and some actions don't user percpu allocator at all
(pedit, for example) and have no use for this flag at the moment.

>
> cheers,
> jamal

Vlad Buslov Oct. 23, 2019, 1:08 p.m. UTC | #12

On Wed 23 Oct 2019 at 16:02, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> I shouldve read the thread backward. My earlier email was asking
> similar question to Roman.
>
> On 2019-10-23 2:38 a.m., Vlad Buslov wrote:
>>
>> On Tue 22 Oct 2019 at 21:17, Roman Mashak <mrv@mojatatu.com> wrote:
>
>> Hi Roman,
>>
>> I considered it, but didn't find good way to implement my change with
>> TCA_ROOT_FLAGS. I needed some flags to be per-action for following
>> reasons:
>>
>> 1. Not all actions support the flag (only implemented for hw offloaded
>>     actions).
>>
>>
>> 2. TCA_ROOT_FLAGS is act API specific and I need this to work when
>>     actions are created when actions are created with filters through cls
>>     API. I guess I could have changed tcf_action_init_1() to require
>>     having TCA_ROOT_FLAGS before actions attribute and then pass obtained
>>     value to act_ops->init() as additional argument, but it sounds more
>>     complex and ugly.
>
> I really shouldve read the thread backwards;->
> The question is if this uglier than introducing a new TLV for every
> action.

I'm not introducing it for every action, only for minority of them.

>
> cheers,
> jamal

Jamal Hadi Salim Oct. 23, 2019, 2:21 p.m. UTC | #13

On 2019-10-23 9:04 a.m., Vlad Buslov wrote:
> 
> On Wed 23 Oct 2019 at 15:49, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> Hi Vlad,
>>

>> I understand your use case being different since it is for h/w
>> offload. If you have time can you test with batching many actions
>> and seeing the before/after improvement?
> 
> Will do.

Thanks.

I think you may have published number before, but would be interesting
to see the before and after of adding the action first and measuring the
filter improvement without caring about the allocator.

> 
>>
>> Note: even for h/w offload it makes sense to first create the actions
>> then bind to filters (in my world thats what we end up doing).
>> If we can improve the first phase it is a win for both s/w and hw use
>> cases.
>>
>> Question:
>> Given TCA_ACT_FLAGS_FAST_INIT is common to all actions would it make
>> sense to use Could you have used a TLV in the namespace of TCA_ACT_MAX
>> (outer TLV)? You will have to pass a param to ->init().
> 
> It is not common for all actions. I omitted modifying actions that are
> not offloaded and some actions don't user percpu allocator at all
> (pedit, for example) and have no use for this flag at the moment.

pedit just never got updated (its simple to update). There is
value in the software to have _all_ the actions use per cpu stats.
It improves fast path performance.

Jiri complains constantly about all these new per-action TLVs
which are generic. He promised to "fix it all" someday. Jiri i notice
your ack here, what happened? ;->

cheers,
jamal

Vlad Buslov Oct. 23, 2019, 3:04 p.m. UTC | #14

On Wed 23 Oct 2019 at 17:21, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-23 9:04 a.m., Vlad Buslov wrote:
>>
>> On Wed 23 Oct 2019 at 15:49, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>> Hi Vlad,
>>>
>
>>> I understand your use case being different since it is for h/w
>>> offload. If you have time can you test with batching many actions
>>> and seeing the before/after improvement?
>>
>> Will do.
>
> Thanks.
>
> I think you may have published number before, but would be interesting
> to see the before and after of adding the action first and measuring the
> filter improvement without caring about the allocator.

For filter with single gact drop action (first line in insertion rate
table in the cover letter) I get insertion rate of 412k rules/sec with
all of the actions preallocated in advance, which is 2x improvement.

>
>>
>>>
>>> Note: even for h/w offload it makes sense to first create the actions
>>> then bind to filters (in my world thats what we end up doing).
>>> If we can improve the first phase it is a win for both s/w and hw use
>>> cases.
>>>
>>> Question:
>>> Given TCA_ACT_FLAGS_FAST_INIT is common to all actions would it make
>>> sense to use Could you have used a TLV in the namespace of TCA_ACT_MAX
>>> (outer TLV)? You will have to pass a param to ->init().
>>
>> It is not common for all actions. I omitted modifying actions that are
>> not offloaded and some actions don't user percpu allocator at all
>> (pedit, for example) and have no use for this flag at the moment.
>
> pedit just never got updated (its simple to update). There is
> value in the software to have _all_ the actions use per cpu stats.
> It improves fast path performance.
>
> Jiri complains constantly about all these new per-action TLVs
> which are generic. He promised to "fix it all" someday. Jiri i notice
> your ack here, what happened? ;->
>
> cheers,
> jamal

Jiri Pirko Oct. 24, 2019, 7:35 a.m. UTC | #15

Wed, Oct 23, 2019 at 04:21:51PM CEST, jhs@mojatatu.com wrote:
>On 2019-10-23 9:04 a.m., Vlad Buslov wrote:
>> 
>> On Wed 23 Oct 2019 at 15:49, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> > Hi Vlad,
>> > 
>
>> > I understand your use case being different since it is for h/w
>> > offload. If you have time can you test with batching many actions
>> > and seeing the before/after improvement?
>> 
>> Will do.
>
>Thanks.
>
>I think you may have published number before, but would be interesting
>to see the before and after of adding the action first and measuring the
>filter improvement without caring about the allocator.
>
>> 
>> > 
>> > Note: even for h/w offload it makes sense to first create the actions
>> > then bind to filters (in my world thats what we end up doing).
>> > If we can improve the first phase it is a win for both s/w and hw use
>> > cases.
>> > 
>> > Question:
>> > Given TCA_ACT_FLAGS_FAST_INIT is common to all actions would it make
>> > sense to use Could you have used a TLV in the namespace of TCA_ACT_MAX
>> > (outer TLV)? You will have to pass a param to ->init().
>> 
>> It is not common for all actions. I omitted modifying actions that are
>> not offloaded and some actions don't user percpu allocator at all
>> (pedit, for example) and have no use for this flag at the moment.
>
>pedit just never got updated (its simple to update). There is
>value in the software to have _all_ the actions use per cpu stats.
>It improves fast path performance.
>
>Jiri complains constantly about all these new per-action TLVs
>which are generic. He promised to "fix it all" someday. Jiri i notice
>your ack here, what happened? ;->

Correct, it would be great. However not sure how exactly to do that now.
Do you have some ideas.

But basically this patchset does what was done many many times in the
past. I think it was a mistake in the original design not to have some
"common attrs" :/ Lesson learned for next interfaces.

Roopa Prabhu Oct. 24, 2019, 3:12 p.m. UTC | #16

On Tue, Oct 22, 2019 at 10:10 AM Marcelo Ricardo Leitner
<mleitner@redhat.com> wrote:
>
> On Tue, Oct 22, 2019 at 03:52:31PM +0000, Vlad Buslov wrote:
> >
> > On Tue 22 Oct 2019 at 18:15, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
> > > On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
> > >> - Extend actions that are used for hardware offloads with optional
> > >>   netlink 32bit flags field. Add TCA_ACT_FLAGS_FAST_INIT action flag and
> > >>   update affected actions to not allocate percpu counters when the flag
> > >>   is set.
> > >
> > > I just went over all the patches and they mostly make sense to me. So
> > > far the only point I'm uncertain of is the naming of the flag,
> > > "fast_init".  That is not clear on what it does and can be overloaded
> > > with other stuff later and we probably don't want that.
> >
> > I intentionally named it like that because I do want to overload it with
> > other stuff in future, instead of adding new flag value for every single
> > small optimization we might come up with :)
>
> Hah :-)
>
> >
> > Also, I didn't want to hardcode implementation details into UAPI that we
> > will have to maintain for long time after percpu allocator in kernel is
> > potentially replaced with something new and better (like idr is being
> > replaced with xarray now, for example)
>
> I see. OTOH, this also means that the UAPI here would be unstable
> (different meanings over time for the same call), and hopefully new
> behaviors would always be backwards compatible.

+1, I also think optimization flags should be specific to what they optimize.
TCA_ACT_FLAGS_NO_PERCPU_STATS seems like a better choice. It allows
user to explicitly request for it.

Vlad Buslov Oct. 24, 2019, 3:18 p.m. UTC | #17

On Thu 24 Oct 2019 at 18:12, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> On Tue, Oct 22, 2019 at 10:10 AM Marcelo Ricardo Leitner
> <mleitner@redhat.com> wrote:
>>
>> On Tue, Oct 22, 2019 at 03:52:31PM +0000, Vlad Buslov wrote:
>> >
>> > On Tue 22 Oct 2019 at 18:15, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
>> > > On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
>> > >> - Extend actions that are used for hardware offloads with optional
>> > >>   netlink 32bit flags field. Add TCA_ACT_FLAGS_FAST_INIT action flag and
>> > >>   update affected actions to not allocate percpu counters when the flag
>> > >>   is set.
>> > >
>> > > I just went over all the patches and they mostly make sense to me. So
>> > > far the only point I'm uncertain of is the naming of the flag,
>> > > "fast_init".  That is not clear on what it does and can be overloaded
>> > > with other stuff later and we probably don't want that.
>> >
>> > I intentionally named it like that because I do want to overload it with
>> > other stuff in future, instead of adding new flag value for every single
>> > small optimization we might come up with :)
>>
>> Hah :-)
>>
>> >
>> > Also, I didn't want to hardcode implementation details into UAPI that we
>> > will have to maintain for long time after percpu allocator in kernel is
>> > potentially replaced with something new and better (like idr is being
>> > replaced with xarray now, for example)
>>
>> I see. OTOH, this also means that the UAPI here would be unstable
>> (different meanings over time for the same call), and hopefully new
>> behaviors would always be backwards compatible.
>
> +1, I also think optimization flags should be specific to what they optimize.
> TCA_ACT_FLAGS_NO_PERCPU_STATS seems like a better choice. It allows
> user to explicitly request for it.

Why would user care about details of optimizations that doesn't produce
visible side effects for user land? Again, counters continue to work the
same with or without the flag.

Vlad Buslov Oct. 24, 2019, 3:23 p.m. UTC | #18

On Thu 24 Oct 2019 at 10:35, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Oct 23, 2019 at 04:21:51PM CEST, jhs@mojatatu.com wrote:
>>On 2019-10-23 9:04 a.m., Vlad Buslov wrote:
>>> 
>>> On Wed 23 Oct 2019 at 15:49, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>> > Hi Vlad,
>>> > 
>>
>>> > I understand your use case being different since it is for h/w
>>> > offload. If you have time can you test with batching many actions
>>> > and seeing the before/after improvement?
>>> 
>>> Will do.
>>
>>Thanks.
>>
>>I think you may have published number before, but would be interesting
>>to see the before and after of adding the action first and measuring the
>>filter improvement without caring about the allocator.
>>
>>> 
>>> > 
>>> > Note: even for h/w offload it makes sense to first create the actions
>>> > then bind to filters (in my world thats what we end up doing).
>>> > If we can improve the first phase it is a win for both s/w and hw use
>>> > cases.
>>> > 
>>> > Question:
>>> > Given TCA_ACT_FLAGS_FAST_INIT is common to all actions would it make
>>> > sense to use Could you have used a TLV in the namespace of TCA_ACT_MAX
>>> > (outer TLV)? You will have to pass a param to ->init().
>>> 
>>> It is not common for all actions. I omitted modifying actions that are
>>> not offloaded and some actions don't user percpu allocator at all
>>> (pedit, for example) and have no use for this flag at the moment.
>>
>>pedit just never got updated (its simple to update). There is
>>value in the software to have _all_ the actions use per cpu stats.
>>It improves fast path performance.
>>
>>Jiri complains constantly about all these new per-action TLVs
>>which are generic. He promised to "fix it all" someday. Jiri i notice
>>your ack here, what happened? ;->
>
> Correct, it would be great. However not sure how exactly to do that now.
> Do you have some ideas.
>
> But basically this patchset does what was done many many times in the
> past. I think it was a mistake in the original design not to have some
> "common attrs" :/ Lesson learned for next interfaces.

Jamal, can we reach some conclusion? Do you still suggest to refactor
the patches to use TCA_ROOT_FLAGS and parse it in act API instead of
individual actions?

Thanks,
Vlad

Jamal Hadi Salim Oct. 24, 2019, 4:12 p.m. UTC | #19

On 2019-10-24 11:23 a.m., Vlad Buslov wrote:
> On Thu 24 Oct 2019 at 10:35, Jiri Pirko <jiri@resnulli.us> wrote:
>> Wed, Oct 23, 2019 at 04:21:51PM CEST, jhs@mojatatu.com wrote:
>>> On 2019-10-23 9:04 a.m., Vlad Buslov wrote:
>>>>

[..]
>>> Jiri complains constantly about all these new per-action TLVs
>>> which are generic. He promised to "fix it all" someday. Jiri i notice
>>> your ack here, what happened? ;->
>>
>> Correct, it would be great. However not sure how exactly to do that now.
>> Do you have some ideas.
>>
 >>
>> But basically this patchset does what was done many many times in the
>> past. I think it was a mistake in the original design not to have some
>> "common attrs" :/ Lesson learned for next interfaces.
> 

Jiri, we have a high level TLV space which can be applied to all
actions. See discussion below with Vlad. At least for this specific
change we can get away from repeating that mistake.

> Jamal, can we reach some conclusion? Do you still suggest to refactor
> the patches to use TCA_ROOT_FLAGS and parse it in act API instead of
> individual actions?
> 

IMO this would certainly help us walk away from having
every action replicate the same attribute with common semantics.
refactoring ->init() to take an extra attribute may look ugly but
is cleaner from a uapi pov. Actions that dont implement the feature
can ignore the extra parameter(s). It doesnt have to be TCA_ROOT_FLAGS
but certainly that high level TLV space is the right place to put it.
WDYT?

cheers,
jamal

Vlad Buslov Oct. 24, 2019, 4:44 p.m. UTC | #20

On Thu 24 Oct 2019 at 19:12, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-24 11:23 a.m., Vlad Buslov wrote:
>> On Thu 24 Oct 2019 at 10:35, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Oct 23, 2019 at 04:21:51PM CEST, jhs@mojatatu.com wrote:
>>>> On 2019-10-23 9:04 a.m., Vlad Buslov wrote:
>>>>>
>
> [..]
>>>> Jiri complains constantly about all these new per-action TLVs
>>>> which are generic. He promised to "fix it all" someday. Jiri i notice
>>>> your ack here, what happened? ;->
>>>
>>> Correct, it would be great. However not sure how exactly to do that now.
>>> Do you have some ideas.
>>>
>>>
>>> But basically this patchset does what was done many many times in the
>>> past. I think it was a mistake in the original design not to have some
>>> "common attrs" :/ Lesson learned for next interfaces.
>>
>
> Jiri, we have a high level TLV space which can be applied to all
> actions. See discussion below with Vlad. At least for this specific
> change we can get away from repeating that mistake.
>
>> Jamal, can we reach some conclusion? Do you still suggest to refactor
>> the patches to use TCA_ROOT_FLAGS and parse it in act API instead of
>> individual actions?
>>
>
> IMO this would certainly help us walk away from having
> every action replicate the same attribute with common semantics.
> refactoring ->init() to take an extra attribute may look ugly but
> is cleaner from a uapi pov. Actions that dont implement the feature
> can ignore the extra parameter(s). It doesnt have to be TCA_ROOT_FLAGS
> but certainly that high level TLV space is the right place to put it.
> WDYT?
>
> cheers,
> jamal

Well, I like having it per-action better because of reasons I explained
before (some actions don't use percpu allocator at all and some actions
that are not hw offloaded don't need it), but I think both solutions
have their benefits and drawbacks, so I'm fine with refactoring it.

Do you have any opinion regarding flag naming? Several people suggested
to be more specific, but I strongly dislike the idea of hardcoding the
name of a internal kernel data structure in UAPI constant that will
potentially outlive the data structure by a long time.

Jamal Hadi Salim Oct. 24, 2019, 5:17 p.m. UTC | #21

Hi Vlad,

On 2019-10-24 12:44 p.m., Vlad Buslov wrote:

> 
> Well, I like having it per-action better because of reasons I explained
> before (some actions don't use percpu allocator at all and some actions
> that are not hw offloaded don't need it), but I think both solutions
> have their benefits and drawbacks, so I'm fine with refactoring it.
>

I am happy you are doing all this great work already. I would be happier 
if you did it at the root level. It is something that we have been
meaning to deal with for a while now.

> Do you have any opinion regarding flag naming? Several people suggested
> to be more specific, but I strongly dislike the idea of hardcoding the
> name of a internal kernel data structure in UAPI constant that will
> potentially outlive the data structure by a long time.

Could you not just name the bit with a define to say what the bit
is for and still use the top level flag? Example we have
a bit called "TCA_FLAG_LARGE_DUMP_ON"

cheers,
jamal

Marcelo Leitner Oct. 24, 2019, 5:31 p.m. UTC | #22

On Thu, Oct 24, 2019 at 03:18:00PM +0000, Vlad Buslov wrote:
> 
> On Thu 24 Oct 2019 at 18:12, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> > On Tue, Oct 22, 2019 at 10:10 AM Marcelo Ricardo Leitner
> > <mleitner@redhat.com> wrote:
> >>
> >> On Tue, Oct 22, 2019 at 03:52:31PM +0000, Vlad Buslov wrote:
> >> >
> >> > On Tue 22 Oct 2019 at 18:15, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
> >> > > On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
> >> > >> - Extend actions that are used for hardware offloads with optional
> >> > >>   netlink 32bit flags field. Add TCA_ACT_FLAGS_FAST_INIT action flag and
> >> > >>   update affected actions to not allocate percpu counters when the flag
> >> > >>   is set.
> >> > >
> >> > > I just went over all the patches and they mostly make sense to me. So
> >> > > far the only point I'm uncertain of is the naming of the flag,
> >> > > "fast_init".  That is not clear on what it does and can be overloaded
> >> > > with other stuff later and we probably don't want that.
> >> >
> >> > I intentionally named it like that because I do want to overload it with
> >> > other stuff in future, instead of adding new flag value for every single
> >> > small optimization we might come up with :)
> >>
> >> Hah :-)
> >>
> >> >
> >> > Also, I didn't want to hardcode implementation details into UAPI that we
> >> > will have to maintain for long time after percpu allocator in kernel is
> >> > potentially replaced with something new and better (like idr is being
> >> > replaced with xarray now, for example)
> >>
> >> I see. OTOH, this also means that the UAPI here would be unstable
> >> (different meanings over time for the same call), and hopefully new
> >> behaviors would always be backwards compatible.
> >
> > +1, I also think optimization flags should be specific to what they optimize.
> > TCA_ACT_FLAGS_NO_PERCPU_STATS seems like a better choice. It allows
> > user to explicitly request for it.
> 
> Why would user care about details of optimizations that doesn't produce
> visible side effects for user land? Again, counters continue to work the
> same with or without the flag.

It's just just details of optimizations, on whether to use likely() or
not, and it does produce user visible effects. Not in terms of API but
of system behavior. Otherwise we wouldn't need the flag, right?
_FAST_INIT, or the fact that it inits faster, is just one of the
aspects of it, but one could also want it for being lighther in
footprint as currently it is really easy to eat Gigs of RAM away on
these percpu counters. So how should it be called, _FAST_INIT or
_SLIM_RULES?

It may be implementation detail, yes, but we shouldn't be building
usage profiles and instead let the user pick what they want. Likewise,
we don't have net.ipv4.fast_tcp, but net.ipv4.tcp_sack, tcp_timestamps
& cia.

If we can find another name then, without using 'percpu' on it but
without stablishing profiles, that would be nice.
Like TCA_ACT_FLAGS_SIMPLE_STATS, or so.

Even though I still prefer the PERCPU, as it's as explicit as it can be. Note
that bpf does it like that already:
uapi]$ grep BPF.*HASH -- linux/bpf.h
        BPF_MAP_TYPE_HASH,
        BPF_MAP_TYPE_PERCPU_HASH,
        BPF_MAP_TYPE_LRU_HASH,
        BPF_MAP_TYPE_LRU_PERCPU_HASH,
...

Vlad Buslov Oct. 24, 2019, 6:03 p.m. UTC | #23

On Thu 24 Oct 2019 at 20:31, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
> On Thu, Oct 24, 2019 at 03:18:00PM +0000, Vlad Buslov wrote:
>>
>> On Thu 24 Oct 2019 at 18:12, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>> > On Tue, Oct 22, 2019 at 10:10 AM Marcelo Ricardo Leitner
>> > <mleitner@redhat.com> wrote:
>> >>
>> >> On Tue, Oct 22, 2019 at 03:52:31PM +0000, Vlad Buslov wrote:
>> >> >
>> >> > On Tue 22 Oct 2019 at 18:15, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
>> >> > > On Tue, Oct 22, 2019 at 05:17:51PM +0300, Vlad Buslov wrote:
>> >> > >> - Extend actions that are used for hardware offloads with optional
>> >> > >>   netlink 32bit flags field. Add TCA_ACT_FLAGS_FAST_INIT action flag and
>> >> > >>   update affected actions to not allocate percpu counters when the flag
>> >> > >>   is set.
>> >> > >
>> >> > > I just went over all the patches and they mostly make sense to me. So
>> >> > > far the only point I'm uncertain of is the naming of the flag,
>> >> > > "fast_init".  That is not clear on what it does and can be overloaded
>> >> > > with other stuff later and we probably don't want that.
>> >> >
>> >> > I intentionally named it like that because I do want to overload it with
>> >> > other stuff in future, instead of adding new flag value for every single
>> >> > small optimization we might come up with :)
>> >>
>> >> Hah :-)
>> >>
>> >> >
>> >> > Also, I didn't want to hardcode implementation details into UAPI that we
>> >> > will have to maintain for long time after percpu allocator in kernel is
>> >> > potentially replaced with something new and better (like idr is being
>> >> > replaced with xarray now, for example)
>> >>
>> >> I see. OTOH, this also means that the UAPI here would be unstable
>> >> (different meanings over time for the same call), and hopefully new
>> >> behaviors would always be backwards compatible.
>> >
>> > +1, I also think optimization flags should be specific to what they optimize.
>> > TCA_ACT_FLAGS_NO_PERCPU_STATS seems like a better choice. It allows
>> > user to explicitly request for it.
>>
>> Why would user care about details of optimizations that doesn't produce
>> visible side effects for user land? Again, counters continue to work the
>> same with or without the flag.
>
> It's just just details of optimizations, on whether to use likely() or
> not, and it does produce user visible effects. Not in terms of API but
> of system behavior. Otherwise we wouldn't need the flag, right?
> _FAST_INIT, or the fact that it inits faster, is just one of the
> aspects of it, but one could also want it for being lighther in
> footprint as currently it is really easy to eat Gigs of RAM away on
> these percpu counters. So how should it be called, _FAST_INIT or
> _SLIM_RULES?

Hmm, yes, I think I emphasize insertion rate too much because it was my
main goal, but memory usage is also important.

>
> It may be implementation detail, yes, but we shouldn't be building
> usage profiles and instead let the user pick what they want. Likewise,
> we don't have net.ipv4.fast_tcp, but net.ipv4.tcp_sack, tcp_timestamps
> & cia.
>
> If we can find another name then, without using 'percpu' on it but
> without stablishing profiles, that would be nice.
> Like TCA_ACT_FLAGS_SIMPLE_STATS, or so.
>
> Even though I still prefer the PERCPU, as it's as explicit as it can be. Note
> that bpf does it like that already:
> uapi]$ grep BPF.*HASH -- linux/bpf.h
>         BPF_MAP_TYPE_HASH,
>         BPF_MAP_TYPE_PERCPU_HASH,
>         BPF_MAP_TYPE_LRU_HASH,
>         BPF_MAP_TYPE_LRU_PERCPU_HASH,
> ...

I would argue that all of these produce visible side effect for the
user. Selective acks are clearly visible in packet dumps, BPF maps are
directly accessed from BPF programs loaded from user space, etc. But I
get your point that naming can be improved from FAST_INIT which is too
abstract.

Vlad Buslov Oct. 24, 2019, 6:05 p.m. UTC | #24

On Thu 24 Oct 2019 at 20:17, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> Hi Vlad,
>
> On 2019-10-24 12:44 p.m., Vlad Buslov wrote:
>
>>
>> Well, I like having it per-action better because of reasons I explained
>> before (some actions don't use percpu allocator at all and some actions
>> that are not hw offloaded don't need it), but I think both solutions
>> have their benefits and drawbacks, so I'm fine with refactoring it.
>>
>
> I am happy you are doing all this great work already. I would be happier if you
> did it at the root level. It is something that we have been
> meaning to deal with for a while now.
>
>> Do you have any opinion regarding flag naming? Several people suggested
>> to be more specific, but I strongly dislike the idea of hardcoding the
>> name of a internal kernel data structure in UAPI constant that will
>> potentially outlive the data structure by a long time.
>
> Could you not just name the bit with a define to say what the bit
> is for and still use the top level flag? Example we have
> a bit called "TCA_FLAG_LARGE_DUMP_ON"

Yes, of course. I was talking strictly about naming of
TCA_ACT_FLAGS_FAST_INIT flag value constant.

>
> cheers,
> jamal

Jamal Hadi Salim Oct. 25, 2019, 11:46 a.m. UTC | #25

On 2019-10-24 2:05 p.m., Vlad Buslov wrote:
> 

> 
> Yes, of course. I was talking strictly about naming of
> TCA_ACT_FLAGS_FAST_INIT flag value constant.
> 

Sorry, I missed that. You know discussions on names and
punctiations can last a lifetime;->[1].
A name which reflects semantics or established style
is always helpful. So Marcelo's suggestion is reasonable..
In the same spirit as TCA_FLAG_LARGE_DUMP_ON, how about:
TCA_FLAG_NO_PERCPU_STATS?

cheers,
jamal
[1]http://bikeshed.com/

Vlad Buslov Oct. 25, 2019, 11:55 a.m. UTC | #26

On Fri 25 Oct 2019 at 14:46, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-24 2:05 p.m., Vlad Buslov wrote:
>>
>
>>
>> Yes, of course. I was talking strictly about naming of
>> TCA_ACT_FLAGS_FAST_INIT flag value constant.
>>
>
> Sorry, I missed that. You know discussions on names and
> punctiations can last a lifetime;->[1].
> A name which reflects semantics or established style
> is always helpful. So Marcelo's suggestion is reasonable..
> In the same spirit as TCA_FLAG_LARGE_DUMP_ON, how about:
> TCA_FLAG_NO_PERCPU_STATS?
>
> cheers,
> jamal
> [1]http://bikeshed.com/

It looks like I'm the only one who doesn't like to hardcode internal
kernel data structure names into UAPI. Will change it in V2.

Thanks,
Vlad

Vlad Buslov Oct. 25, 2019, 2:26 p.m. UTC | #27

On Thu 24 Oct 2019 at 19:12, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-24 11:23 a.m., Vlad Buslov wrote:
>> On Thu 24 Oct 2019 at 10:35, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Oct 23, 2019 at 04:21:51PM CEST, jhs@mojatatu.com wrote:
>>>> On 2019-10-23 9:04 a.m., Vlad Buslov wrote:
>>>>>
>
> [..]
>>>> Jiri complains constantly about all these new per-action TLVs
>>>> which are generic. He promised to "fix it all" someday. Jiri i notice
>>>> your ack here, what happened? ;->
>>>
>>> Correct, it would be great. However not sure how exactly to do that now.
>>> Do you have some ideas.
>>>
>>>
>>> But basically this patchset does what was done many many times in the
>>> past. I think it was a mistake in the original design not to have some
>>> "common attrs" :/ Lesson learned for next interfaces.
>>
>
> Jiri, we have a high level TLV space which can be applied to all
> actions. See discussion below with Vlad. At least for this specific
> change we can get away from repeating that mistake.
>
>> Jamal, can we reach some conclusion? Do you still suggest to refactor
>> the patches to use TCA_ROOT_FLAGS and parse it in act API instead of
>> individual actions?
>>
>
> IMO this would certainly help us walk away from having
> every action replicate the same attribute with common semantics.
> refactoring ->init() to take an extra attribute may look ugly but
> is cleaner from a uapi pov. Actions that dont implement the feature
> can ignore the extra parameter(s). It doesnt have to be TCA_ROOT_FLAGS
> but certainly that high level TLV space is the right place to put it.
> WDYT?
>
> cheers,
> jamal

Hi Jamal,

I've been trying to re-design this change to use high level TLV, but I
don't understand how to make it backward compatible. If I just put new
attribute before nested action attributes, it will fail to parse when
older client send netlink packet without it and I don't see
straightforward way to distinguish between the new attribute and
following nested action attribute (which might accidentally have same
length and fall into allowed attribute range). I don't have much
experience working with netlink, so I might be missing something obvious
here. However, extending existing action attributes (which are already
conveniently parsed in tcf_action_init_1() function) with new optional
'flags' value seems like straightforward and backward compatible
solution.

Rough outline in code:

2 files changed, 6 insertions(+), 2 deletions(-)
include/uapi/linux/pkt_cls.h | 1 +
net/sched/act_api.c          | 7 +++++--

modified   include/uapi/linux/pkt_cls.h
@@ -16,6 +16,7 @@ enum {
 	TCA_ACT_STATS,
 	TCA_ACT_PAD,
 	TCA_ACT_COOKIE,
+	TCA_ACT_FLAGS,
 	__TCA_ACT_MAX
 };
 
modified   net/sched/act_api.c
@@ -852,6 +852,7 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	char act_name[IFNAMSIZ];
 	struct nlattr *tb[TCA_ACT_MAX + 1];
 	struct nlattr *kind;
+	u32 flags = 0;
 	int err;
 
 	if (name == NULL) {
@@ -877,6 +878,8 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 				goto err_out;
 			}
 		}
+		if (tb[TCA_ACT_FLAGS])
+			flags = nla_get_bitfield32(tb[TCA_ACT_FLAGS]).value;
 	} else {
 		if (strlcpy(act_name, name, IFNAMSIZ) >= IFNAMSIZ) {
 			NL_SET_ERR_MSG(extack, "TC action name too long");
@@ -915,10 +918,10 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	/* backward compatibility for policer */
 	if (name == NULL)
 		err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, ovr, bind,
-				rtnl_held, tp, extack);
+				rtnl_held, tp, flags, extack);
 	else
 		err = a_o->init(net, nla, est, &a, ovr, bind, rtnl_held,
-				tp, extack);
+				tp, flags, extack);
 	if (err < 0)
 		goto err_mod;


WDYT?

Regards,
Vlad

Jamal Hadi Salim Oct. 25, 2019, 2:57 p.m. UTC | #28

On 2019-10-25 10:26 a.m., Vlad Buslov wrote:

> I've been trying to re-design this change to use high level TLV, but I
> don't understand how to make it backward compatible. If I just put new
> attribute before nested action attributes, it will fail to parse when
> older client send netlink packet without it and I don't see
> straightforward way to distinguish between the new attribute and
> following nested action attribute (which might accidentally have same
> length and fall into allowed attribute range). I don't have much
> experience working with netlink, so I might be missing something obvious
> here. However, extending existing action attributes (which are already
> conveniently parsed in tcf_action_init_1() function) with new optional
> 'flags' value seems like straightforward and backward compatible
> solution.
>

I think i understand what you are saying. Let me take a quick
look at the code and get back to you.

cheers,
jamal

Jamal Hadi Salim Oct. 25, 2019, 3:08 p.m. UTC | #29

On 2019-10-25 10:57 a.m., Jamal Hadi Salim wrote:
> On 2019-10-25 10:26 a.m., Vlad Buslov wrote:

> 
> I think i understand what you are saying. Let me take a quick
> look at the code and get back to you.
>

How about something like this (not even compile tested)..

Yes, that init is getting uglier... over time if we
are going to move things into shared attributes we will
need to introduce a new data struct to pass to ->init()

cheers,
jamal
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 69d4676a402f..b0d1e00fe2c2 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -842,6 +842,7 @@ static const struct nla_policy tcf_action_policy[TCA_ACT_MAX + 1] = {
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 				    struct nlattr *nla, struct nlattr *est,
 				    char *name, int ovr, int bind,
+				    struct nla_bitfield32 *root_flags,
 				    bool rtnl_held,
 				    struct netlink_ext_ack *extack)
 {
@@ -914,10 +915,10 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	/* backward compatibility for policer */
 	if (name == NULL)
 		err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, ovr, bind,
-				rtnl_held, tp, extack);
+				rtnl_held, root_flags, tp, extack);
 	else
 		err = a_o->init(net, nla, est, &a, ovr, bind, rtnl_held,
-				tp, extack);
+				root_flags, tp, extack);
 	if (err < 0)
 		goto err_mod;
 
@@ -955,7 +956,8 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 		    struct nlattr *est, char *name, int ovr, int bind,
 		    struct tc_action *actions[], size_t *attr_size,
-		    bool rtnl_held, struct netlink_ext_ack *extack)
+		    bool rtnl_held, struct nla_bitfield32 *root_flags,
+		    struct netlink_ext_ack *extack)
 {
 	struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
 	struct tc_action *act;
@@ -970,7 +972,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 
 	for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
 		act = tcf_action_init_1(net, tp, tb[i], est, name, ovr, bind,
-					rtnl_held, extack);
+					root_flags, rtnl_held, extack);
 		if (IS_ERR(act)) {
 			err = PTR_ERR(act);
 			goto err;
@@ -1350,6 +1352,7 @@ tcf_add_notify(struct net *net, struct nlmsghdr *n, struct tc_action *actions[],
 
 static int tcf_action_add(struct net *net, struct nlattr *nla,
 			  struct nlmsghdr *n, u32 portid, int ovr,
+			  struct nla_bitfield32 *root_flags,
 			  struct netlink_ext_ack *extack)
 {
 	size_t attr_size = 0;
@@ -1358,7 +1361,8 @@ static int tcf_action_add(struct net *net, struct nlattr *nla,
 
 	for (loop = 0; loop < 10; loop++) {
 		ret = tcf_action_init(net, NULL, nla, NULL, NULL, ovr, 0,
-				      actions, &attr_size, true, extack);
+				      actions, &attr_size, true, root_flags,
+				      extack);
 		if (ret != -EAGAIN)
 			break;
 	}
@@ -1385,6 +1389,7 @@ static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
 	struct net *net = sock_net(skb->sk);
 	struct nlattr *tca[TCA_ROOT_MAX + 1];
 	u32 portid = skb ? NETLINK_CB(skb).portid : 0;
+	struct nla_bitfield32 root_flags = {0, 0};
 	int ret = 0, ovr = 0;
 
 	if ((n->nlmsg_type != RTM_GETACTION) &&
@@ -1412,8 +1417,11 @@ static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
 		 */
 		if (n->nlmsg_flags & NLM_F_REPLACE)
 			ovr = 1;
+		if (tb[TCA_ROOT_FLAGS])
+			root_flags = nla_get_bitfield32(tb[TCA_ROOT_FLAGS]);
+
 		ret = tcf_action_add(net, tca[TCA_ACT_TAB], n, portid, ovr,
-				     extack);
+				     &root_flags, extack);
 		break;
 	case RTM_DELACTION:
 		ret = tca_action_gd(net, tca[TCA_ACT_TAB], n,

Vlad Buslov Oct. 25, 2019, 3:18 p.m. UTC | #30

On Fri 25 Oct 2019 at 18:08, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-25 10:57 a.m., Jamal Hadi Salim wrote:
>> On 2019-10-25 10:26 a.m., Vlad Buslov wrote:
>
>>
>> I think i understand what you are saying. Let me take a quick
>> look at the code and get back to you.
>>
>
> How about something like this (not even compile tested)..
>
> Yes, that init is getting uglier... over time if we
> are going to move things into shared attributes we will
> need to introduce a new data struct to pass to ->init()
>
> cheers,
> jamal

The problem with this approach is that it only works when actions are
created through act API, and not when they are created together with
filter by cls API which doesn't expect or parse TCA_ROOT. That is why I
wanted to have something in tcf_action_init_1() which is called by both
of them.

Jamal Hadi Salim Oct. 25, 2019, 3:43 p.m. UTC | #31

On 2019-10-25 11:18 a.m., Vlad Buslov wrote:
> 

> The problem with this approach is that it only works when actions are
> created through act API, and not when they are created together with
> filter by cls API which doesn't expect or parse TCA_ROOT. That is why I
> wanted to have something in tcf_action_init_1() which is called by both
> of them.
> 

Aha. So the call path for tcf_action_init_1() via cls_api also needs
to have this infra. I think i understand better what you wanted
to do earlier with changing those enums.

This is fixable - we just need to have the classifier side also take a
new action root flags bitfield (I can send a sample). To your original
comment, it is ugly. But maybe "fixing it" is pushing the boundaries
and we should just go on and let your original approach in.
My only worry is, given this is uapi, if we go back to your original
idea we continue to propagate the bad design and we cant take it back
(all the tooling etc would be cast for the next 5 years even if we
did fix it in a month). Thoughts?

cheers,
jamal

Jamal Hadi Salim Oct. 25, 2019, 4:06 p.m. UTC | #32

On 2019-10-25 11:43 a.m., Jamal Hadi Salim wrote:
> On 2019-10-25 11:18 a.m., Vlad Buslov wrote:
>>
> 
>> The problem with this approach is that it only works when actions are
>> created through act API, and not when they are created together with
>> filter by cls API which doesn't expect or parse TCA_ROOT. That is why I
>> wanted to have something in tcf_action_init_1() which is called by both
>> of them.
>>
> 
> Aha. So the call path for tcf_action_init_1() via cls_api also needs
> to have this infra. I think i understand better what you wanted
> to do earlier with changing those enums.
> 

Hold on. Looking more at the code, direct call for tcf_action_init_1()
from the cls code path is for backward compat of old policer approach.
I think even modern iproute2 doesnt support that kind of call
anymore. So you can pass NULL there for the *flags.

But: for direct call to tcf_action_init() we would have
to extract the flag from the TLV.
The TLV already has TCA_ROOT_FLAGS in it.

cheers,
jamal

Vlad Buslov Oct. 25, 2019, 4:08 p.m. UTC | #33

On Fri 25 Oct 2019 at 18:43, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-25 11:18 a.m., Vlad Buslov wrote:
>>
>
>> The problem with this approach is that it only works when actions are
>> created through act API, and not when they are created together with
>> filter by cls API which doesn't expect or parse TCA_ROOT. That is why I
>> wanted to have something in tcf_action_init_1() which is called by both
>> of them.
>>
>
> Aha. So the call path for tcf_action_init_1() via cls_api also needs
> to have this infra. I think i understand better what you wanted
> to do earlier with changing those enums.
>
> This is fixable - we just need to have the classifier side also take a
> new action root flags bitfield (I can send a sample). To your original

Trying to add this infra to cls API leads back to my original question:
how do I do it in backward compatible manner? I assume that we can't
break users of RTM_NEWTFILTER.

> comment, it is ugly. But maybe "fixing it" is pushing the boundaries
> and we should just go on and let your original approach in.
> My only worry is, given this is uapi, if we go back to your original
> idea we continue to propagate the bad design and we cant take it back
> (all the tooling etc would be cast for the next 5 years even if we
> did fix it in a month). Thoughts?

I don't see anything particularly ugly with extending either
action-specific attributes or TCA_ACT_* attributes, so its hard for me
to reason about problems with current approach :)

>
> cheers,
> jamal

Vlad Buslov Oct. 25, 2019, 4:10 p.m. UTC | #34

On Fri 25 Oct 2019 at 19:06, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-25 11:43 a.m., Jamal Hadi Salim wrote:
>> On 2019-10-25 11:18 a.m., Vlad Buslov wrote:
>>>
>>
>>> The problem with this approach is that it only works when actions are
>>> created through act API, and not when they are created together with
>>> filter by cls API which doesn't expect or parse TCA_ROOT. That is why I
>>> wanted to have something in tcf_action_init_1() which is called by both
>>> of them.
>>>
>>
>> Aha. So the call path for tcf_action_init_1() via cls_api also needs
>> to have this infra. I think i understand better what you wanted
>> to do earlier with changing those enums.
>>
>
> Hold on. Looking more at the code, direct call for tcf_action_init_1()
> from the cls code path is for backward compat of old policer approach.
> I think even modern iproute2 doesnt support that kind of call
> anymore. So you can pass NULL there for the *flags.

But having the FAST_INIT flag set when creating actions through cls API
is my main use case. Are you suggesting to only have flags when actions
created through act API?

>
> But: for direct call to tcf_action_init() we would have
> to extract the flag from the TLV.
> The TLV already has TCA_ROOT_FLAGS in it.
>
>
> cheers,
> jamal

Vlad Buslov Oct. 25, 2019, 4:27 p.m. UTC | #35

On Fri 25 Oct 2019 at 19:06, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-25 11:43 a.m., Jamal Hadi Salim wrote:
>> On 2019-10-25 11:18 a.m., Vlad Buslov wrote:
>>>
>>
>>> The problem with this approach is that it only works when actions are
>>> created through act API, and not when they are created together with
>>> filter by cls API which doesn't expect or parse TCA_ROOT. That is why I
>>> wanted to have something in tcf_action_init_1() which is called by both
>>> of them.
>>>
>>
>> Aha. So the call path for tcf_action_init_1() via cls_api also needs
>> to have this infra. I think i understand better what you wanted
>> to do earlier with changing those enums.
>>
>
> Hold on. Looking more at the code, direct call for tcf_action_init_1()
> from the cls code path is for backward compat of old policer approach.
> I think even modern iproute2 doesnt support that kind of call
> anymore. So you can pass NULL there for the *flags.
>
> But: for direct call to tcf_action_init() we would have
> to extract the flag from the TLV.
> The TLV already has TCA_ROOT_FLAGS in it.

After re-reading the code I think what you suggesting is that I can
somehow parse TCA_ROOT_FLAGS in following functions:

int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
		    struct nlattr *est, char *name, int ovr, int bind,
		    struct tc_action *actions[], size_t *attr_size,
		    bool rtnl_held, struct netlink_ext_ack *extack)
{
	struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
	struct tc_action *act;
	size_t sz = 0;
	int err;
	int i;

->	err = nla_parse_nested_deprecated(tb, TCA_ACT_MAX_PRIO, nla, NULL,
					  extack);
	if (err < 0)
		return err;

	for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
		act = tcf_action_init_1(net, tp, tb[i], est, name, ovr, bind,
					rtnl_held, extack);
		if (IS_ERR(act)) {
			err = PTR_ERR(act);
			goto err;
		}
		act->order = i;
		sz += tcf_action_fill_size(act);
		/* Start from index 0 */
		actions[i - 1] = act;
	}

	*attr_size = tcf_action_full_attrs_size(sz);
	return i - 1;

err:
	tcf_action_destroy(actions, bind);
	return err;
}

Again, I'm not too familiar with netlink, but looking at that I assume
that the code parses up to TCA_ACT_MAX_PRIO nested TCA_ACT attributes
and passes them one-by-one to tcf_action_init_1(). Are you suggesting
that it also somehow includes TCA_ROOT?

Regards,
Vlad

Jamal Hadi Salim Oct. 25, 2019, 4:29 p.m. UTC | #36

On 2019-10-25 12:10 p.m., Vlad Buslov wrote:
> 

>> Hold on. Looking more at the code, direct call for tcf_action_init_1()
>> from the cls code path is for backward compat of old policer approach.
>> I think even modern iproute2 doesnt support that kind of call
>> anymore. So you can pass NULL there for the *flags.
> 
> But having the FAST_INIT flag set when creating actions through cls API
> is my main use case. Are you suggesting to only have flags when actions
> created through act API?
>

Not at all. Here's my thinking...

I didnt see your iproute2 change; however, in user space - i think all
the classifiers eventually call parse_action()? You can stick the flags
there under TCA_ACT_ROOT_FLAGS

In the kernel, tcf_exts_validate() - two spots:
tcf_action_init_1() pass NULL
tcf_action_init() before invocation extract the TCA_ACT_ROOT_FLAGS
(similar to the act_api approach).

Am i missing something? Sorry - dont have much cycles right now
but i could do a prototype later.

cheers,
jamal

Vlad Buslov Oct. 25, 2019, 4:53 p.m. UTC | #37

On Fri 25 Oct 2019 at 19:29, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-25 12:10 p.m., Vlad Buslov wrote:
>>
>
>>> Hold on. Looking more at the code, direct call for tcf_action_init_1()
>>> from the cls code path is for backward compat of old policer approach.
>>> I think even modern iproute2 doesnt support that kind of call
>>> anymore. So you can pass NULL there for the *flags.
>>
>> But having the FAST_INIT flag set when creating actions through cls API
>> is my main use case. Are you suggesting to only have flags when actions
>> created through act API?
>>
>
> Not at all. Here's my thinking...
>
> I didnt see your iproute2 change; however, in user space - i think all
> the classifiers eventually call parse_action()? You can stick the flags
> there under TCA_ACT_ROOT_FLAGS
>
> In the kernel, tcf_exts_validate() - two spots:
> tcf_action_init_1() pass NULL
> tcf_action_init() before invocation extract the TCA_ACT_ROOT_FLAGS
> (similar to the act_api approach).
>
> Am i missing something? Sorry - dont have much cycles right now
> but i could do a prototype later.
>
> cheers,
> jamal

I don't exactly follow. In case of act API we have following call chain:
tc_ctl_action()->tcf_action_add()->tcf_action_init(). In this case
TCA_ROOT is parsed in tc_ctl_action() and tcf_action_init() is called
with one of its nested attributes - tca[TCA_ACT_TAB]. When
tcf_action_init() is called TCA_ROOT is already "parsed out", but it is
easy to pass it as an argument.

For cls API lets take flower as an example: fl_change() parses TCA_FLOWER, and calls
fl_set_parms()->tcf_exts_validate()->tcf_action_init() with
TCA_FLOWER_ACT nested attribute. No TCA_ROOT is expected, TCA_FLOWER_ACT
contains up to TCA_ACT_MAX_PRIO nested TCA_ACT attributes. So where can
I include it without breaking backward compatibility?

Regards,
Vlad

Jamal Hadi Salim Oct. 25, 2019, 6:17 p.m. UTC | #38

On 2019-10-25 12:53 p.m., Vlad Buslov wrote:
[..]

Sorry, the distractions here are not helping.
When i responded i had started to do below

enum {
         TCA_ACT_UNSPEC,
         TCA_ACT_KIND,
         TCA_ACT_OPTIONS,
         TCA_ACT_INDEX,
         TCA_ACT_STATS,
         TCA_ACT_PAD,
         TCA_ACT_COOKIE,
         TCA_ACT_ROOT_FLAGS,
         __TCA_ACT_MAX
};

Note: "TCA_ACT_ROOT_FLAGS"
I think your other email may have tried to do the same?
So i claimed it was there in that email because grep showed it
but that's because i had added it;->

> For cls API lets take flower as an example: fl_change() parses TCA_FLOWER, and calls
> fl_set_parms()->tcf_exts_validate()->tcf_action_init() with
> TCA_FLOWER_ACT nested attribute. No TCA_ROOT is expected, TCA_FLOWER_ACT
> contains up to TCA_ACT_MAX_PRIO nested TCA_ACT attributes. So where can
> I include it without breaking backward compatibility?
> 
Not a clean solution but avoids the per action TLV attribute:

in user space parse_action() you add TCA_ACT_ROOT_FLAGS
if the user specifies this on the command line.

in the kernel everything just passes NULL for root_flags
when invocation is from tcf_exts_validate() i.e both
tcf_action_init_1() and
tcf_action_init()

In tcf_action_init_1(), if root_flags is NULL
you try to get flags from from tb[TCA_ACT_ROOT_FLAGS]

Apologies in advance for any latency - I will have time tomorrow.

cheers,
jamal

Jamal Hadi Salim Oct. 25, 2019, 9:05 p.m. UTC | #39

Like attached...

cheers,
jamal

On 2019-10-25 2:17 p.m., Jamal Hadi Salim wrote:
> On 2019-10-25 12:53 p.m., Vlad Buslov wrote:
> [..]
> 
> Sorry, the distractions here are not helping.
> When i responded i had started to do below
> 
> enum {
>          TCA_ACT_UNSPEC,
>          TCA_ACT_KIND,
>          TCA_ACT_OPTIONS,
>          TCA_ACT_INDEX,
>          TCA_ACT_STATS,
>          TCA_ACT_PAD,
>          TCA_ACT_COOKIE,
>          TCA_ACT_ROOT_FLAGS,
>          __TCA_ACT_MAX
> };
> 
> Note: "TCA_ACT_ROOT_FLAGS"
> I think your other email may have tried to do the same?
> So i claimed it was there in that email because grep showed it
> but that's because i had added it;->
> 
>> For cls API lets take flower as an example: fl_change() parses 
>> TCA_FLOWER, and calls
>> fl_set_parms()->tcf_exts_validate()->tcf_action_init() with
>> TCA_FLOWER_ACT nested attribute. No TCA_ROOT is expected, TCA_FLOWER_ACT
>> contains up to TCA_ACT_MAX_PRIO nested TCA_ACT attributes. So where can
>> I include it without breaking backward compatibility?
>>
> Not a clean solution but avoids the per action TLV attribute:
> 
> in user space parse_action() you add TCA_ACT_ROOT_FLAGS
> if the user specifies this on the command line.
> 
> in the kernel everything just passes NULL for root_flags
> when invocation is from tcf_exts_validate() i.e both
> tcf_action_init_1() and
> tcf_action_init()
> 
> In tcf_action_init_1(), if root_flags is NULL
> you try to get flags from from tb[TCA_ACT_ROOT_FLAGS]
> 
> Apologies in advance for any latency - I will have time tomorrow.
> 
> cheers,
> jamal
diff --git a/include/net/act_api.h b/include/net/act_api.h
index b18c699681ca..ec52b28ddfc5 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -93,7 +93,8 @@ struct tc_action_ops {
 	int     (*lookup)(struct net *net, struct tc_action **a, u32 index);
 	int     (*init)(struct net *net, struct nlattr *nla,
 			struct nlattr *est, struct tc_action **act, int ovr,
-			int bind, bool rtnl_held, struct tcf_proto *tp,
+			int bind, bool rtnl_held, struct nla_bitfield32 *rf,
+			struct tcf_proto *tp,
 			struct netlink_ext_ack *extack);
 	int     (*walk)(struct net *, struct sk_buff *,
 			struct netlink_callback *, int,
@@ -176,10 +177,12 @@ int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions,
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 		    struct nlattr *est, char *name, int ovr, int bind,
 		    struct tc_action *actions[], size_t *attr_size,
-		    bool rtnl_held, struct netlink_ext_ack *extack);
+		    bool rtnl_held, struct nla_bitfield32 *root_flags,
+		    struct netlink_ext_ack *extack);
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 				    struct nlattr *nla, struct nlattr *est,
 				    char *name, int ovr, int bind,
+				    struct nla_bitfield32 *root_flags,
 				    bool rtnl_held,
 				    struct netlink_ext_ack *extack);
 int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[], int bind,
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index a6aa466fac9e..d92f3d0f2c79 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -16,6 +16,7 @@ enum {
 	TCA_ACT_STATS,
 	TCA_ACT_PAD,
 	TCA_ACT_COOKIE,
+	TCA_ACT_ROOT_FLAGS,
 	__TCA_ACT_MAX
 };
 
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 69d4676a402f..ef7b0bb735c7 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -842,6 +842,7 @@ static const struct nla_policy tcf_action_policy[TCA_ACT_MAX + 1] = {
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 				    struct nlattr *nla, struct nlattr *est,
 				    char *name, int ovr, int bind,
+				    struct nla_bitfield32 *root_flags,
 				    bool rtnl_held,
 				    struct netlink_ext_ack *extack)
 {
@@ -852,6 +853,7 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	struct nlattr *tb[TCA_ACT_MAX + 1];
 	struct nlattr *kind;
 	int err;
+	struct nla_bitfield32 rf = {0, 0};
 
 	if (name == NULL) {
 		err = nla_parse_nested_deprecated(tb, TCA_ACT_MAX, nla,
@@ -884,6 +886,10 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 		}
 	}
 
+	if (!root_flags && tb[TCA_ACT_ROOT_FLAGS]) {
+		rf = nla_get_bitfield32(tb[TCA_ACT_ROOT_FLAGS]);
+		root_flags = &rf;
+	}
 	a_o = tc_lookup_action_n(act_name);
 	if (a_o == NULL) {
 #ifdef CONFIG_MODULES
@@ -914,10 +920,10 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	/* backward compatibility for policer */
 	if (name == NULL)
 		err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, ovr, bind,
-				rtnl_held, tp, extack);
+				rtnl_held, root_flags, tp, extack);
 	else
 		err = a_o->init(net, nla, est, &a, ovr, bind, rtnl_held,
-				tp, extack);
+				root_flags, tp, extack);
 	if (err < 0)
 		goto err_mod;
 
@@ -955,7 +961,8 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 		    struct nlattr *est, char *name, int ovr, int bind,
 		    struct tc_action *actions[], size_t *attr_size,
-		    bool rtnl_held, struct netlink_ext_ack *extack)
+		    bool rtnl_held, struct nla_bitfield32 *root_flags,
+		    struct netlink_ext_ack *extack)
 {
 	struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
 	struct tc_action *act;
@@ -970,7 +977,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 
 	for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
 		act = tcf_action_init_1(net, tp, tb[i], est, name, ovr, bind,
-					rtnl_held, extack);
+					root_flags, rtnl_held, extack);
 		if (IS_ERR(act)) {
 			err = PTR_ERR(act);
 			goto err;
@@ -1350,6 +1357,7 @@ tcf_add_notify(struct net *net, struct nlmsghdr *n, struct tc_action *actions[],
 
 static int tcf_action_add(struct net *net, struct nlattr *nla,
 			  struct nlmsghdr *n, u32 portid, int ovr,
+			  struct nla_bitfield32 *root_flags,
 			  struct netlink_ext_ack *extack)
 {
 	size_t attr_size = 0;
@@ -1358,7 +1366,8 @@ static int tcf_action_add(struct net *net, struct nlattr *nla,
 
 	for (loop = 0; loop < 10; loop++) {
 		ret = tcf_action_init(net, NULL, nla, NULL, NULL, ovr, 0,
-				      actions, &attr_size, true, extack);
+				      actions, &attr_size, true, root_flags,
+				      extack);
 		if (ret != -EAGAIN)
 			break;
 	}
@@ -1385,6 +1394,7 @@ static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
 	struct net *net = sock_net(skb->sk);
 	struct nlattr *tca[TCA_ROOT_MAX + 1];
 	u32 portid = skb ? NETLINK_CB(skb).portid : 0;
+	struct nla_bitfield32 root_flags = {0, 0};
 	int ret = 0, ovr = 0;
 
 	if ((n->nlmsg_type != RTM_GETACTION) &&
@@ -1412,8 +1422,11 @@ static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
 		 */
 		if (n->nlmsg_flags & NLM_F_REPLACE)
 			ovr = 1;
+		if (tb[TCA_ROOT_FLAGS])
+			root_flags = nla_get_bitfield32(tb[TCA_ROOT_FLAGS]);
+
 		ret = tcf_action_add(net, tca[TCA_ACT_TAB], n, portid, ovr,
-				     extack);
+				     &root_flags, extack);
 		break;
 	case RTM_DELACTION:
 		ret = tca_action_gd(net, tca[TCA_ACT_TAB], n,
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 8717c0b26c90..9dc5bf0d637e 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -2946,7 +2946,7 @@ int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 			act = tcf_action_init_1(net, tp, tb[exts->police],
 						rate_tlv, "police", ovr,
 						TCA_ACT_BIND, rtnl_held,
-						extack);
+						NULL, extack);
 			if (IS_ERR(act))
 				return PTR_ERR(act);
 
@@ -2959,7 +2959,7 @@ int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 			err = tcf_action_init(net, tp, tb[exts->action],
 					      rate_tlv, NULL, ovr, TCA_ACT_BIND,
 					      exts->actions, &attr_size,
-					      rtnl_held, extack);
+					      rtnl_held, NULL, extack);
 			if (err < 0)
 				return err;
 			exts->nr_actions = err;

Jamal Hadi Salim Oct. 25, 2019, 9:10 p.m. UTC | #40

On 2019-10-25 5:05 p.m., Jamal Hadi Salim wrote:
> +	if (!root_flags && tb[TCA_ACT_ROOT_FLAGS]) {
> +		rf = nla_get_bitfield32(tb[TCA_ACT_ROOT_FLAGS]);
> +		root_flags = &rf;
> +	}



!root_flags check doesnt look right.
Hopefully it makes more sense now....

cheers,
jamal

Jamal Hadi Salim Oct. 25, 2019, 9:52 p.m. UTC | #41

On 2019-10-25 5:10 p.m., Jamal Hadi Salim wrote:
> On 2019-10-25 5:05 p.m., Jamal Hadi Salim wrote:
>> +    if (!root_flags && tb[TCA_ACT_ROOT_FLAGS]) {
>> +        rf = nla_get_bitfield32(tb[TCA_ACT_ROOT_FLAGS]);
>> +        root_flags = &rf;
>> +    }
> 
> 
> 
> !root_flags check doesnt look right.
> Hopefully it makes more sense now....
> 

For completion:
It compiled ;->


cheers,
jamal

Vlad Buslov Oct. 26, 2019, 9:44 a.m. UTC | #42

On Sat 26 Oct 2019 at 00:52, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-25 5:10 p.m., Jamal Hadi Salim wrote:
>> On 2019-10-25 5:05 p.m., Jamal Hadi Salim wrote:
>>> +    if (!root_flags && tb[TCA_ACT_ROOT_FLAGS]) {
>>> +        rf = nla_get_bitfield32(tb[TCA_ACT_ROOT_FLAGS]);
>>> +        root_flags = &rf;
>>> +    }
>>
>>
>>
>> !root_flags check doesnt look right.
>> Hopefully it makes more sense now....
>>
>
> For completion:
> It compiled ;->
>
>
> cheers,
> jamal

Okay, I understand now what you suggest. But why not unify cls and act
API, and always have flags parsed in tcf_action_init_1() as
TCA_ACT_ROOT_FLAGS like I suggested in one of my previous mails? That
way we don't have to pass pointers around.

Regards,
Vlad

Jamal Hadi Salim Oct. 26, 2019, 12:26 p.m. UTC | #43

On 2019-10-26 5:44 a.m., Vlad Buslov wrote:
> 

> Okay, I understand now what you suggest. But why not unify cls and act
> API, and always have flags parsed in tcf_action_init_1() as
> TCA_ACT_ROOT_FLAGS like I suggested in one of my previous mails? That
> way we don't have to pass pointers around.

That would work.
I am being a sucker for optimization - one flag for a batch
of actions vs one per action.
It is a good compromise, go for it.

cheers,
jamal

Roman Mashak Oct. 26, 2019, 2:52 p.m. UTC | #44

Jamal Hadi Salim <jhs@mojatatu.com> writes:

> On 2019-10-26 5:44 a.m., Vlad Buslov wrote:
>>
>
>> Okay, I understand now what you suggest. But why not unify cls and act
>> API, and always have flags parsed in tcf_action_init_1() as
>> TCA_ACT_ROOT_FLAGS like I suggested in one of my previous mails? That
>> way we don't have to pass pointers around.
>
> That would work.
> I am being a sucker for optimization - one flag for a batch
> of actions vs one per action.
> It is a good compromise, go for it.

But why do we need to have two attributes, one at the root level
TCA_ROOT_FLAGS and the other at the inner TCA_ACT_* level, but in fact
serving the same purpose -- passing flags for optimizations?

The whole nest of action attributes including root ones is passed as 3rd
argument of tcf_exts_validate(), so it can be validated and extracted at
that level and passed to tcf_action_init_1() as pointer to 32-bit flag,
admittedly it's ugly given the growing number of arguments to
tcf_action_init_1(). With old iproute2 the pointer will always be NULL,
so I think backward compatibilty will be preserved.

Jamal Hadi Salim Oct. 26, 2019, 4:06 p.m. UTC | #45

On 2019-10-26 10:52 a.m., Roman Mashak wrote:
[..]
> 
> But why do we need to have two attributes, one at the root level
> TCA_ROOT_FLAGS and the other at the inner TCA_ACT_* level, but in fact
> serving the same purpose -- passing flags for optimizations?
> 
 >
> The whole nest of action attributes including root ones is passed as 3rd
> argument of tcf_exts_validate(), so it can be validated and extracted at
> that level and passed to tcf_action_init_1() as pointer to 32-bit flag,
> admittedly it's ugly given the growing number of arguments to
> tcf_action_init_1(). With old iproute2 the pointer will always be NULL,
> so I think backward compatibilty will be preserved.

Note: we only call tcf_action_init_1() at that level for very
old policer api for backward compatibility reasons. I think what
would make sense is to be able to call tcf_action_init()(the else
statement in tcf_exts_validate()) from that level with a global flag
but for that we would need to introduce something like TCA_ROOT_FLAGS
under this space:
---
enum {
         TCA_UNSPEC,
         TCA_KIND,
         TCA_OPTIONS,
         TCA_STATS,
         TCA_XSTATS,
         TCA_RATE,
         TCA_FCNT,
         TCA_STATS2,
         TCA_STAB,
         TCA_PAD,
         TCA_DUMP_INVISIBLE,
         TCA_CHAIN,
         TCA_HW_OFFLOAD,
         TCA_INGRESS_BLOCK,
         TCA_EGRESS_BLOCK,
         __TCA_MAX
};
---

which would be a cleaner solution but would require
_a lot more code_ both in user/kernel.
Thats why i feel Vlad's suggestion is a reasonable compromise
because it gets rid of the original issue of per-specific-action
TLVs.

On optimization:
The current suggestion from Vlad is a bit inefficient,
example, if was trying to batch 100 actions i now have 1200
bytes of overhead instead of 12 bytes.

cheers,
jamal

Vlad Buslov Oct. 26, 2019, 4:42 p.m. UTC | #46

On Sat 26 Oct 2019 at 19:06, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-26 10:52 a.m., Roman Mashak wrote:
> [..]
>>
>> But why do we need to have two attributes, one at the root level
>> TCA_ROOT_FLAGS and the other at the inner TCA_ACT_* level, but in fact
>> serving the same purpose -- passing flags for optimizations?
>>
>>
>> The whole nest of action attributes including root ones is passed as 3rd
>> argument of tcf_exts_validate(), so it can be validated and extracted at
>> that level and passed to tcf_action_init_1() as pointer to 32-bit flag,
>> admittedly it's ugly given the growing number of arguments to
>> tcf_action_init_1(). With old iproute2 the pointer will always be NULL,
>> so I think backward compatibilty will be preserved.
>
> Note: we only call tcf_action_init_1() at that level for very
> old policer api for backward compatibility reasons. I think what
> would make sense is to be able to call tcf_action_init()(the else
> statement in tcf_exts_validate()) from that level with a global flag
> but for that we would need to introduce something like TCA_ROOT_FLAGS
> under this space:
> ---
> enum {
>         TCA_UNSPEC,
>         TCA_KIND,
>         TCA_OPTIONS,
>         TCA_STATS,
>         TCA_XSTATS,
>         TCA_RATE,
>         TCA_FCNT,
>         TCA_STATS2,
>         TCA_STAB,
>         TCA_PAD,
>         TCA_DUMP_INVISIBLE,
>         TCA_CHAIN,
>         TCA_HW_OFFLOAD,
>         TCA_INGRESS_BLOCK,
>         TCA_EGRESS_BLOCK,
>         __TCA_MAX
> };
> ---
>
> which would be a cleaner solution but would require
> _a lot more code_ both in user/kernel.
> Thats why i feel Vlad's suggestion is a reasonable compromise
> because it gets rid of the original issue of per-specific-action
> TLVs.
>
> On optimization:
> The current suggestion from Vlad is a bit inefficient,
> example, if was trying to batch 100 actions i now have 1200
> bytes of overhead instead of 12 bytes.
>
> cheers,
> jamal

Hmm, yes, this looks quite redundant. At the same time having basically
same flags in two different spaces is ugly. Maybe correct approach would
be not to add act API at all and only extend tcf_action_init_1() to be
used from cls API? I don't see a use for act API at the moment because
the only flag value (skip percpu allocation) is hardware offloads
specific and such clients generally create action through cls new filter
API. WDYT?

Regards,
Vlad

Jamal Hadi Salim Oct. 26, 2019, 6:38 p.m. UTC | #47

On 2019-10-26 12:42 p.m., Vlad Buslov wrote:
> 
> On Sat 26 Oct 2019 at 19:06, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> Hmm, yes, this looks quite redundant.

It's legit.
Two different code paths for two different objects
that can be configured independently or together.
Only other thing that does this is FIB/NH path.

> At the same time having basically
> same flags in two different spaces is ugly. Maybe correct approach would
> be not to add act API at all and only extend tcf_action_init_1() to be
> used from cls API? I don't see a use for act API at the moment because
> the only flag value (skip percpu allocation) is hardware offloads
> specific and such clients generally create action through cls new filter
> API. WDYT?
> 

I am not sure if there is a gain. Your code path is
tcf_exts_validate()->tcf_action_init()->tcf_action_init_1()

Do you want to replicate the tcf_action_init() in
cls?

cheers,
jamal

Vlad Buslov Oct. 27, 2019, 8:31 a.m. UTC | #48

On Sat 26 Oct 2019 at 21:38, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2019-10-26 12:42 p.m., Vlad Buslov wrote:
>>
>> On Sat 26 Oct 2019 at 19:06, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
>> Hmm, yes, this looks quite redundant.
>
> It's legit.
> Two different code paths for two different objects
> that can be configured independently or together.
> Only other thing that does this is FIB/NH path.
>
>> At the same time having basically
>> same flags in two different spaces is ugly. Maybe correct approach would
>> be not to add act API at all and only extend tcf_action_init_1() to be
>> used from cls API? I don't see a use for act API at the moment because
>> the only flag value (skip percpu allocation) is hardware offloads
>> specific and such clients generally create action through cls new filter
>> API. WDYT?
>>
>
> I am not sure if there is a gain. Your code path is
> tcf_exts_validate()->tcf_action_init()->tcf_action_init_1()
>
> Do you want to replicate the tcf_action_init() in
> cls?
>
> cheers,
> jamal

No, I meant that we don't need to change iproute2 to always send new
TCA_ACT_ROOT_FLAGS. They can be set conditionally (only when user set
the flag) or only when creating filters with actions attached (cls API).
Such approach wouldn't introduce any additional overhead to netlink
packets when batch creating actions.

Regards,
Vlad

Jamal Hadi Salim Oct. 27, 2019, 9:13 a.m. UTC | #49

On 2019-10-27 4:31 a.m., Vlad Buslov wrote:

> 
> No, I meant that we don't need to change iproute2 to always send new
> TCA_ACT_ROOT_FLAGS. They can be set conditionally (only when user set
> the flag) or only when creating filters with actions attached (cls API).
> Such approach wouldn't introduce any additional overhead to netlink
> packets when batch creating actions.
> 
iproute2 will only send conditionally.
parse_actions() in iproute2 will only set this field if the user
explicitly set it on the command line for that action - either when
an action is  specified as standalone(or in a batch) or when bound
to a filter should make no difference. No?

The overhead i was talking about was worst case scenario.

cheers,
jamal

[net-next,00/13] Control action percpu counters allocation by netlink flag

Message

Comments