[net-next,v6,2/3] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

Message ID 1492772132-16559-3-git-send-email-jhs@emojatatu.com
State Changes Requested
Delegated to: David Miller
Headers show

Commit Message

Jamal Hadi Salim April 21, 2017, 10:55 a.m.
From: Jamal Hadi Salim <jhs@mojatatu.com>

When you dump hundreds of thousands of actions, getting only 32 per
dump batch even when the socket buffer and memory allocations allow
is inefficient.

With this change, the user will get as many as possibly fitting
within the given constraints available to the kernel.

The top level action TLV space is extended. An attribute
TCA_ROOT_FLAGS is used to carry flags; flag TCA_FLAG_LARGE_DUMP_ON
is set by the user indicating the user is capable of processing
these large dumps. Older user space which doesnt set this flag
doesnt get the large (than 32) batches.
The kernel uses the TCA_ROOT_COUNT attribute to tell the user how many
actions are put in a single batch. As such user space app knows how long
to iterate (independent of the type of action being dumped)
instead of hardcoded maximum of 32.

Some results dumping 1.5M actions, first unpatched tc which the
kernel doesnt help:

prompt$ time -p tc actions ls action gact | grep index | wc -l
1500000
real 1388.43
user 2.07
sys 1386.79

Now lets see a patched tc which sets the correct flags when requesting
a dump:

prompt$ time -p updatedtc actions ls action gact | grep index | wc -l
1500000
real 178.13
user 2.02
sys 176.96

That is about 8x performance improvement for tc which sets its
receive buffer to about 32K.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/uapi/linux/rtnetlink.h | 21 +++++++++++++++++--
 net/sched/act_api.c            | 46 +++++++++++++++++++++++++++++++++---------
 2 files changed, 55 insertions(+), 12 deletions(-)

Comments

Jiri Pirko April 21, 2017, 1:12 p.m. | #1
Fri, Apr 21, 2017 at 12:55:31PM CEST, jhs@mojatatu.com wrote:
>From: Jamal Hadi Salim <jhs@mojatatu.com>
>
>When you dump hundreds of thousands of actions, getting only 32 per
>dump batch even when the socket buffer and memory allocations allow
>is inefficient.
>
>With this change, the user will get as many as possibly fitting
>within the given constraints available to the kernel.
>
>The top level action TLV space is extended. An attribute
>TCA_ROOT_FLAGS is used to carry flags; flag TCA_FLAG_LARGE_DUMP_ON
>is set by the user indicating the user is capable of processing
>these large dumps. Older user space which doesnt set this flag
>doesnt get the large (than 32) batches.
>The kernel uses the TCA_ROOT_COUNT attribute to tell the user how many
>actions are put in a single batch. As such user space app knows how long
>to iterate (independent of the type of action being dumped)
>instead of hardcoded maximum of 32.
>
>Some results dumping 1.5M actions, first unpatched tc which the
>kernel doesnt help:
>
>prompt$ time -p tc actions ls action gact | grep index | wc -l
>1500000
>real 1388.43
>user 2.07
>sys 1386.79
>
>Now lets see a patched tc which sets the correct flags when requesting
>a dump:
>
>prompt$ time -p updatedtc actions ls action gact | grep index | wc -l
>1500000
>real 178.13
>user 2.02
>sys 176.96
>
>That is about 8x performance improvement for tc which sets its
>receive buffer to about 32K.
>
>Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
>---
> include/uapi/linux/rtnetlink.h | 21 +++++++++++++++++--
> net/sched/act_api.c            | 46 +++++++++++++++++++++++++++++++++---------
> 2 files changed, 55 insertions(+), 12 deletions(-)
>
>diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
>index cce0613..09e7b22d 100644
>--- a/include/uapi/linux/rtnetlink.h
>+++ b/include/uapi/linux/rtnetlink.h
>@@ -674,10 +674,27 @@ struct tcamsg {
> 	unsigned char	tca__pad1;
> 	unsigned short	tca__pad2;
> };
>+
>+enum {
>+	TCA_ROOT_UNSPEC,
>+	TCA_ROOT_TAB,
>+#define TCA_ACT_TAB TCA_ROOT_TAB
>+	TCA_ROOT_FLAGS,
>+	TCA_ROOT_COUNT,
>+	__TCA_ROOT_MAX,
>+#define	TCA_ROOT_MAX (__TCA_ROOT_MAX - 1)
>+};
>+
> #define TA_RTA(r)  ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct tcamsg))))
> #define TA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct tcamsg))
>-#define TCA_ACT_TAB 1 /* attr type must be >=1 */	
>-#define TCAA_MAX 1
>+/* tcamsg flags stored in attribute TCA_ROOT_FLAGS
>+ *
>+ * TCA_FLAG_LARGE_DUMP_ON user->kernel to request for larger than TCA_ACT_MAX_PRIO
>+ * actions in a dump. All dump responses will contain the number of actions
>+ * being dumped stored in for user app's consumption in TCA_ROOT_COUNT
>+ *
>+ */
>+#define TCA_FLAG_LARGE_DUMP_ON		(1 << 0)

This is u32 "flags" that could not be extended for other use in future.
I'm missing the point. Also, you don't check the rest of the bits for 0
as requested by DaveM.

As far as this is unextendable, please have this as u8 with values 0 and 1
as I originally suggested.

I don't understand why are we running in circles about this...
Jamal Hadi Salim April 21, 2017, 3:12 p.m. | #2
On 17-04-21 09:12 AM, Jiri Pirko wrote:
> Fri, Apr 21, 2017 at 12:55:31PM CEST, jhs@mojatatu.com wrote:
>> From: Jamal Hadi Salim <jhs@mojatatu.com>

>> +#define TCA_FLAG_LARGE_DUMP_ON		(1 << 0)
>
> This is u32 "flags" that could not be extended for other use in future.
> I'm missing the point. Also, you don't check the rest of the bits for 0
> as requested by DaveM.
>
> As far as this is unextendable, please have this as u8 with values 0 and 1
> as I originally suggested.
>
> I don't understand why are we running in circles about this...
>

If i have a 32 bit space of which i am using one bit.
The sender (user space) zeroes the bits except the one they are 
interested in. The kernel checks the bits they are interested in.
Future - we add one more bit and the same philosophy applies.
Older kernels dont see this bit but they dont have the feature
to begin with. So where is the lack of extensibility?

Jiri, there is a balance between extensibility and performance.
It is senseless to use a TLV just so i can set a 0/1(true/false).

cheers,
jamal
Eric Dumazet April 21, 2017, 3:20 p.m. | #3
On Fri, 2017-04-21 at 11:12 -0400, Jamal Hadi Salim wrote:
> On 17-04-21 09:12 AM, Jiri Pirko wrote:
> > Fri, Apr 21, 2017 at 12:55:31PM CEST, jhs@mojatatu.com wrote:
> >> From: Jamal Hadi Salim <jhs@mojatatu.com>
> 
> >> +#define TCA_FLAG_LARGE_DUMP_ON		(1 << 0)
> >
> > This is u32 "flags" that could not be extended for other use in future.
> > I'm missing the point. Also, you don't check the rest of the bits for 0
> > as requested by DaveM.
> >
> > As far as this is unextendable, please have this as u8 with values 0 and 1
> > as I originally suggested.
> >
> > I don't understand why are we running in circles about this...
> >
> 
> If i have a 32 bit space of which i am using one bit.
> The sender (user space) zeroes the bits except the one they are 
> interested in. The kernel checks the bits they are interested in.
> Future - we add one more bit and the same philosophy applies.
> Older kernels dont see this bit but they dont have the feature
> to begin with. So where is the lack of extensibility?
> 
> Jiri, there is a balance between extensibility and performance.
> It is senseless to use a TLV just so i can set a 0/1(true/false).

You assume that the (user space) did sensible things.

Sometimes they do not, and sets some bits to 1 while they should not.

If old kernel just ignored theses bits, application just ran fine and
was _qualified_. 

Now customers might use these _working_ applications.

Then, Jamal comes and change the kernel to give a meaning to these bits.

Now the customer is running the new kernel and the old application
breaks horribly.

Who is at fault ? Jamal of course, not the application authors that
might be out of business, and could not have any test that could have
spot the (future) issue.

Please Jamal, can we stop this for good ?
Jamal Hadi Salim April 21, 2017, 3:40 p.m. | #4
On 17-04-21 11:20 AM, Eric Dumazet wrote:
> On Fri, 2017-04-21 at 11:12 -0400, Jamal Hadi Salim wrote:
>> On 17-04-21 09:12 AM, Jiri Pirko wrote:
>>> Fri, Apr 21, 2017 at 12:55:31PM CEST, jhs@mojatatu.com wrote:
>>>> From: Jamal Hadi Salim <jhs@mojatatu.com>
>>

>> Jiri, there is a balance between extensibility and performance.
>> It is senseless to use a TLV just so i can set a 0/1(true/false).
>
> You assume that the (user space) did sensible things.
>

If it didnt it is a bug. Seriously.
Look: If this was the case a lot of things would break in the
kernel. We have bit flags everywhere. We add new ones frequently
to these bitmaps.

> Sometimes they do not, and sets some bits to 1 while they should not.
>

That is a bug.  You cant blame the compiler.

> If old kernel just ignored theses bits, application just ran fine and
> was _qualified_.
>
> Now customers might use these _working_ applications.
>
> Then, Jamal comes and change the kernel to give a meaning to these bits.
>

As happens frequently, not just Jamal - Eric also;->

> Now the customer is running the new kernel and the old application
> breaks horribly.
>
> Who is at fault ? Jamal of course, not the application authors that
> might be out of business, and could not have any test that could have
> spot the (future) issue.
>
> Please Jamal, can we stop this for good ?

Eric: Your are speaking in generalities and you starting premise is
wrong. Anyone setting random bits in a netlink bitmask that has not
been defined is creating bug. We have many examples of how netlink
bitmasks are being used and constantly extended. Please take a look.

If i was the first person starting this today, then yes you will be
making a lot of sense.
For the pads - the arguement that malloc-ing the datastructure may put
random values in the pads was a reasonable arguement. But this is not.

cheers,
jamal
David Miller April 21, 2017, 4:07 p.m. | #5
From: Jamal Hadi Salim <jhs@mojatatu.com>
Date: Fri, 21 Apr 2017 11:40:00 -0400

> Eric: Your are speaking in generalities and you starting premise is
> wrong.

I disagree.

If we never checked, it is our problem and our issue.  Not that of
the user.

If you want to start checking and verifying new attribute bitmasks
now, great!  But we are stuck in the case of existing bitmasks
and pads because we did not do so at the time we released them
into the wild.
Eric Dumazet April 21, 2017, 4:10 p.m. | #6
On Fri, 2017-04-21 at 11:40 -0400, Jamal Hadi Salim wrote:
> On 17-04-21 11:20 AM, Eric Dumazet wrote:
> > On Fri, 2017-04-21 at 11:12 -0400, Jamal Hadi Salim wrote:
> >> On 17-04-21 09:12 AM, Jiri Pirko wrote:
> >>> Fri, Apr 21, 2017 at 12:55:31PM CEST, jhs@mojatatu.com wrote:
> >>>> From: Jamal Hadi Salim <jhs@mojatatu.com>
> >>
> 
> >> Jiri, there is a balance between extensibility and performance.
> >> It is senseless to use a TLV just so i can set a 0/1(true/false).
> >
> > You assume that the (user space) did sensible things.
> >
> 
> If it didnt it is a bug. Seriously.


If the application runs fine on linux-4.0, it is our _duty_ [1] to allow
it to run on linux-4.12.

Particularly if we have very easy ways to do so.

Like using new attributes instead of trying to 'reuse' some padding.

Pretending the old compiler or old applications were buggy is absolutely
irrelevant. That is not the point.

[1] Ask Linus Torvalds about that if you really need to get a final
arbitration.

Patch

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index cce0613..09e7b22d 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -674,10 +674,27 @@  struct tcamsg {
 	unsigned char	tca__pad1;
 	unsigned short	tca__pad2;
 };
+
+enum {
+	TCA_ROOT_UNSPEC,
+	TCA_ROOT_TAB,
+#define TCA_ACT_TAB TCA_ROOT_TAB
+	TCA_ROOT_FLAGS,
+	TCA_ROOT_COUNT,
+	__TCA_ROOT_MAX,
+#define	TCA_ROOT_MAX (__TCA_ROOT_MAX - 1)
+};
+
 #define TA_RTA(r)  ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct tcamsg))))
 #define TA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct tcamsg))
-#define TCA_ACT_TAB 1 /* attr type must be >=1 */	
-#define TCAA_MAX 1
+/* tcamsg flags stored in attribute TCA_ROOT_FLAGS
+ *
+ * TCA_FLAG_LARGE_DUMP_ON user->kernel to request for larger than TCA_ACT_MAX_PRIO
+ * actions in a dump. All dump responses will contain the number of actions
+ * being dumped stored in for user app's consumption in TCA_ROOT_COUNT
+ *
+ */
+#define TCA_FLAG_LARGE_DUMP_ON		(1 << 0)
 
 /* New extended info filters for IFLA_EXT_MASK */
 #define RTEXT_FILTER_VF		(1 << 0)
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 9ce22b7..cfb3548 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -83,6 +83,7 @@  static int tcf_dump_walker(struct tcf_hashinfo *hinfo, struct sk_buff *skb,
 			   struct netlink_callback *cb)
 {
 	int err = 0, index = -1, i = 0, s_i = 0, n_i = 0;
+	u32 act_flags = cb->args[2];
 	struct nlattr *nest;
 
 	spin_lock_bh(&hinfo->lock);
@@ -111,14 +112,18 @@  static int tcf_dump_walker(struct tcf_hashinfo *hinfo, struct sk_buff *skb,
 			}
 			nla_nest_end(skb, nest);
 			n_i++;
-			if (n_i >= TCA_ACT_MAX_PRIO)
+			if (!(act_flags & TCA_FLAG_LARGE_DUMP_ON) &&
+			    n_i >= TCA_ACT_MAX_PRIO)
 				goto done;
 		}
 	}
 done:
 	spin_unlock_bh(&hinfo->lock);
-	if (n_i)
+	if (n_i) {
 		cb->args[0] += n_i;
+		if (act_flags & TCA_FLAG_LARGE_DUMP_ON)
+			cb->args[1] = n_i;
+	}
 	return n_i;
 
 nla_put_failure:
@@ -993,11 +998,15 @@  static int tcf_action_add(struct net *net, struct nlattr *nla,
 	return tcf_add_notify(net, n, &actions, portid);
 }
 
+static const struct nla_policy tcaa_policy[TCA_ROOT_MAX + 1] = {
+	[TCA_ROOT_FLAGS]      = { .type = NLA_U32 },
+};
+
 static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
 			 struct netlink_ext_ack *extack)
 {
 	struct net *net = sock_net(skb->sk);
-	struct nlattr *tca[TCAA_MAX + 1];
+	struct nlattr *tca[TCA_ROOT_MAX + 1];
 	u32 portid = skb ? NETLINK_CB(skb).portid : 0;
 	int ret = 0, ovr = 0;
 
@@ -1005,7 +1014,7 @@  static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
 	    !netlink_capable(skb, CAP_NET_ADMIN))
 		return -EPERM;
 
-	ret = nlmsg_parse(n, sizeof(struct tcamsg), tca, TCAA_MAX, NULL,
+	ret = nlmsg_parse(n, sizeof(struct tcamsg), tca, TCA_ROOT_MAX, NULL,
 			  extack);
 	if (ret < 0)
 		return ret;
@@ -1046,16 +1055,12 @@  static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
 	return ret;
 }
 
-static struct nlattr *find_dump_kind(const struct nlmsghdr *n)
+static struct nlattr *find_dump_kind(struct nlattr **nla)
 {
 	struct nlattr *tb1, *tb2[TCA_ACT_MAX + 1];
 	struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
-	struct nlattr *nla[TCAA_MAX + 1];
 	struct nlattr *kind;
 
-	if (nlmsg_parse(n, sizeof(struct tcamsg), nla, TCAA_MAX,
-			NULL, NULL) < 0)
-		return NULL;
 	tb1 = nla[TCA_ACT_TAB];
 	if (tb1 == NULL)
 		return NULL;
@@ -1082,8 +1087,18 @@  static int tc_dump_action(struct sk_buff *skb, struct netlink_callback *cb)
 	struct tc_action_ops *a_o;
 	int ret = 0;
 	struct tcamsg *t = (struct tcamsg *) nlmsg_data(cb->nlh);
-	struct nlattr *kind = find_dump_kind(cb->nlh);
+	struct nlattr *count_attr = NULL;
+	struct nlattr *tb[TCA_ROOT_MAX + 1];
+	struct nlattr *kind = NULL;
+	u32 act_flags = 0;
+	u32 act_count = 0;
+
+	ret = nlmsg_parse(cb->nlh, sizeof(struct tcamsg), tb, TCA_ROOT_MAX,
+			  tcaa_policy, NULL);
+	if (ret < 0)
+		return ret;
 
+	kind = find_dump_kind(tb);
 	if (kind == NULL) {
 		pr_info("tc_dump_action: action bad kind\n");
 		return 0;
@@ -1093,14 +1108,22 @@  static int tc_dump_action(struct sk_buff *skb, struct netlink_callback *cb)
 	if (a_o == NULL)
 		return 0;
 
+	if (tb[TCA_ROOT_FLAGS])
+		act_flags = nla_get_u32(tb[TCA_ROOT_FLAGS]);
+
 	nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
 			cb->nlh->nlmsg_type, sizeof(*t), 0);
 	if (!nlh)
 		goto out_module_put;
+
+	cb->args[2] = act_flags;
 	t = nlmsg_data(nlh);
 	t->tca_family = AF_UNSPEC;
 	t->tca__pad1 = 0;
 	t->tca__pad2 = 0;
+	count_attr = nla_reserve(skb, TCA_ROOT_COUNT, sizeof(u32));
+	if (!count_attr)
+		goto out_module_put;
 
 	nest = nla_nest_start(skb, TCA_ACT_TAB);
 	if (nest == NULL)
@@ -1113,6 +1136,9 @@  static int tc_dump_action(struct sk_buff *skb, struct netlink_callback *cb)
 	if (ret > 0) {
 		nla_nest_end(skb, nest);
 		ret = skb->len;
+		act_count = cb->args[1];
+		memcpy(nla_data(count_attr), &act_count, sizeof(u32));
+		cb->args[1] = 0;
 	} else
 		nlmsg_trim(skb, b);