Message ID | 1506190048.29839.206.camel@edumazet-glaptop3.roam.corp.google.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
Series | [net-next] sch_netem: faster rb tree removal | expand |
On 9/23/17 12:07 PM, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > While running TCP tests involving netem storing millions of packets, > I had the idea to speed up tfifo_reset() and did experiments. > > I tried the rbtree_postorder_for_each_entry_safe() method that is > used in skb_rbtree_purge() but discovered it was slower than the > current tfifo_reset() method. > > I measured time taken to release skbs with three occupation levels : > 10^4, 10^5 and 10^6 skbs with three methods : > > 1) (current 'naive' method) > > while ((p = rb_first(&q->t_root))) { > struct sk_buff *skb = netem_rb_to_skb(p); > > rb_erase(p, &q->t_root); > rtnl_kfree_skbs(skb, skb); > } > > 2) Use rb_next() instead of rb_first() in the loop : > > p = rb_first(&q->t_root); > while (p) { > struct sk_buff *skb = netem_rb_to_skb(p); > > p = rb_next(p); > rb_erase(&skb->rbnode, &q->t_root); > rtnl_kfree_skbs(skb, skb); > } > > 3) "optimized" method using rbtree_postorder_for_each_entry_safe() > > struct sk_buff *skb, *next; > > rbtree_postorder_for_each_entry_safe(skb, next, > &q->t_root, rbnode) { > rtnl_kfree_skbs(skb, skb); > } > q->t_root = RB_ROOT; > > Results : > > method_1:while (rb_first()) rb_erase() 10000 skbs in 690378 ns (69 ns per skb) > method_2:rb_first; while (p) { p = rb_next(p); ...} 10000 skbs in 541846 ns (54 ns per skb) > method_3:rbtree_postorder_for_each_entry_safe() 10000 skbs in 868307 ns (86 ns per skb) > > method_1:while (rb_first()) rb_erase() 99996 skbs in 7804021 ns (78 ns per skb) > method_2:rb_first; while (p) { p = rb_next(p); ...} 100000 skbs in 5942456 ns (59 ns per skb) > method_3:rbtree_postorder_for_each_entry_safe() 100000 skbs in 11584940 ns (115 ns per skb) > > method_1:while (rb_first()) rb_erase() 1000000 skbs in 108577838 ns (108 ns per skb) > method_2:rb_first; while (p) { p = rb_next(p); ...} 1000000 skbs in 82619635 ns (82 ns per skb) > method_3:rbtree_postorder_for_each_entry_safe() 1000000 skbs in 127328743 ns (127 ns per skb) > > Method 2) is simply faster, probably because it maintains a smaller > working size set. > > Note that this is the method we use in tcp_ofo_queue() already. > > I will also change skb_rbtree_purge() in a second patch. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- > net/sched/sch_netem.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c > index 063a4bdb9ee6f26b01387959e8f6ccd15ec16191..5a4f1008029068372019a965186e7a3c0a18aac3 100644 > --- a/net/sched/sch_netem.c > +++ b/net/sched/sch_netem.c > @@ -361,12 +361,13 @@ static psched_time_t packet_len_2_sched_time(unsigned int len, struct netem_sche > static void tfifo_reset(struct Qdisc *sch) > { > struct netem_sched_data *q = qdisc_priv(sch); > - struct rb_node *p; > + struct rb_node *p = rb_first(&q->t_root); > > - while ((p = rb_first(&q->t_root))) { > + while (p) { > struct sk_buff *skb = netem_rb_to_skb(p); > > - rb_erase(p, &q->t_root); > + p = rb_next(p); > + rb_erase(&skb->rbnode, &q->t_root); > rtnl_kfree_skbs(skb, skb); > } > } > > Hi Eric: I'm guessing the cost is in the rb_first and rb_next computations. Did you consider something like this: struct rb_root *root struct rb_node **p = &root->rb_node; while (*p != NULL) { struct foobar *fb; fb = container_of(*p, struct foobar, rb_node); // fb processing p = &root->rb_node; }
On 9/24/17 7:57 PM, David Ahern wrote: > On 9/23/17 12:07 PM, Eric Dumazet wrote: >> From: Eric Dumazet <edumazet@google.com> >> >> While running TCP tests involving netem storing millions of packets, >> I had the idea to speed up tfifo_reset() and did experiments. >> >> I tried the rbtree_postorder_for_each_entry_safe() method that is >> used in skb_rbtree_purge() but discovered it was slower than the >> current tfifo_reset() method. >> >> I measured time taken to release skbs with three occupation levels : >> 10^4, 10^5 and 10^6 skbs with three methods : >> >> 1) (current 'naive' method) >> >> while ((p = rb_first(&q->t_root))) { >> struct sk_buff *skb = netem_rb_to_skb(p); >> >> rb_erase(p, &q->t_root); >> rtnl_kfree_skbs(skb, skb); >> } >> >> 2) Use rb_next() instead of rb_first() in the loop : >> >> p = rb_first(&q->t_root); >> while (p) { >> struct sk_buff *skb = netem_rb_to_skb(p); >> >> p = rb_next(p); >> rb_erase(&skb->rbnode, &q->t_root); >> rtnl_kfree_skbs(skb, skb); >> } >> >> 3) "optimized" method using rbtree_postorder_for_each_entry_safe() >> >> struct sk_buff *skb, *next; >> >> rbtree_postorder_for_each_entry_safe(skb, next, >> &q->t_root, rbnode) { >> rtnl_kfree_skbs(skb, skb); >> } >> q->t_root = RB_ROOT; >> >> Results : >> >> method_1:while (rb_first()) rb_erase() 10000 skbs in 690378 ns (69 ns per skb) >> method_2:rb_first; while (p) { p = rb_next(p); ...} 10000 skbs in 541846 ns (54 ns per skb) >> method_3:rbtree_postorder_for_each_entry_safe() 10000 skbs in 868307 ns (86 ns per skb) >> >> method_1:while (rb_first()) rb_erase() 99996 skbs in 7804021 ns (78 ns per skb) >> method_2:rb_first; while (p) { p = rb_next(p); ...} 100000 skbs in 5942456 ns (59 ns per skb) >> method_3:rbtree_postorder_for_each_entry_safe() 100000 skbs in 11584940 ns (115 ns per skb) >> >> method_1:while (rb_first()) rb_erase() 1000000 skbs in 108577838 ns (108 ns per skb) >> method_2:rb_first; while (p) { p = rb_next(p); ...} 1000000 skbs in 82619635 ns (82 ns per skb) >> method_3:rbtree_postorder_for_each_entry_safe() 1000000 skbs in 127328743 ns (127 ns per skb) >> >> Method 2) is simply faster, probably because it maintains a smaller >> working size set. >> >> Note that this is the method we use in tcp_ofo_queue() already. >> >> I will also change skb_rbtree_purge() in a second patch. >> >> Signed-off-by: Eric Dumazet <edumazet@google.com> >> --- >> net/sched/sch_netem.c | 7 ++++--- >> 1 file changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c >> index 063a4bdb9ee6f26b01387959e8f6ccd15ec16191..5a4f1008029068372019a965186e7a3c0a18aac3 100644 >> --- a/net/sched/sch_netem.c >> +++ b/net/sched/sch_netem.c >> @@ -361,12 +361,13 @@ static psched_time_t packet_len_2_sched_time(unsigned int len, struct netem_sche >> static void tfifo_reset(struct Qdisc *sch) >> { >> struct netem_sched_data *q = qdisc_priv(sch); >> - struct rb_node *p; >> + struct rb_node *p = rb_first(&q->t_root); >> >> - while ((p = rb_first(&q->t_root))) { >> + while (p) { >> struct sk_buff *skb = netem_rb_to_skb(p); >> >> - rb_erase(p, &q->t_root); >> + p = rb_next(p); >> + rb_erase(&skb->rbnode, &q->t_root); >> rtnl_kfree_skbs(skb, skb); >> } >> } >> >> > > Hi Eric: > > I'm guessing the cost is in the rb_first and rb_next computations. Did > you consider something like this: > > struct rb_root *root > struct rb_node **p = &root->rb_node; > > while (*p != NULL) { > struct foobar *fb; > > fb = container_of(*p, struct foobar, rb_node); > // fb processing rb_erase(&nh->rb_node, root); > p = &root->rb_node; > } > Oops, dropped the rb_erase in my consolidating the code to this snippet.
On Sun, 2017-09-24 at 20:05 -0600, David Ahern wrote: > On 9/24/17 7:57 PM, David Ahern wrote: > > Hi Eric: > > > > I'm guessing the cost is in the rb_first and rb_next computations. Did > > you consider something like this: > > > > struct rb_root *root > > struct rb_node **p = &root->rb_node; > > > > while (*p != NULL) { > > struct foobar *fb; > > > > fb = container_of(*p, struct foobar, rb_node); > > // fb processing > rb_erase(&nh->rb_node, root); > > > p = &root->rb_node; > > } > > > > Oops, dropped the rb_erase in my consolidating the code to this snippet. Hi David This gives about same numbers than method_1 I tried with 10^7 skbs in the tree : Your suggestion takes 66ns per skb, while the one I chose takes 37ns per skb. Thanks.
On 9/24/17 11:27 PM, Eric Dumazet wrote: > On Sun, 2017-09-24 at 20:05 -0600, David Ahern wrote: >> On 9/24/17 7:57 PM, David Ahern wrote: > >>> Hi Eric: >>> >>> I'm guessing the cost is in the rb_first and rb_next computations. Did >>> you consider something like this: >>> >>> struct rb_root *root >>> struct rb_node **p = &root->rb_node; >>> >>> while (*p != NULL) { >>> struct foobar *fb; >>> >>> fb = container_of(*p, struct foobar, rb_node); >>> // fb processing >> rb_erase(&nh->rb_node, root); >> >>> p = &root->rb_node; >>> } >>> >> >> Oops, dropped the rb_erase in my consolidating the code to this snippet. > > Hi David > > This gives about same numbers than method_1 > > I tried with 10^7 skbs in the tree : > > Your suggestion takes 66ns per skb, while the one I chose takes 37ns per > skb. Thanks for the test. I made a simple program this morning and ran it under perf. With the above suggestion the rb_erase has a high cost because it always deletes the root node. Your method 1 has a high cost on rb_first which is expected given its definition and it is run on each removal. Both options increase in time with the number of entries in the tree. Your method 2 is fairly constant from 10,000 entries to 10M entries which makes sense: a one time cost at finding rb_first and then always removing a bottom node so rb_erase is light. As for the change: Acked-by: David Ahern <dsahern@gmail.com>
On Mon, 2017-09-25 at 10:14 -0600, David Ahern wrote: > Thanks for the test. > > I made a simple program this morning and ran it under perf. With the > above suggestion the rb_erase has a high cost because it always deletes > the root node. Your method 1 has a high cost on rb_first which is > expected given its definition and it is run on each removal. Both > options increase in time with the number of entries in the tree. > > Your method 2 is fairly constant from 10,000 entries to 10M entries > which makes sense: a one time cost at finding rb_first and then always > removing a bottom node so rb_erase is light. > > As for the change: > > Acked-by: David Ahern <dsahern@gmail.com> Thanks a lot for double checking !
From: David Ahern <dsahern@gmail.com> Date: Mon, 25 Sep 2017 10:14:23 -0600 > I made a simple program this morning and ran it under perf. If possible please submit this for selftests. Thank you.
On 9/25/17 2:11 PM, David Miller wrote: > From: David Ahern <dsahern@gmail.com> > Date: Mon, 25 Sep 2017 10:14:23 -0600 > >> I made a simple program this morning and ran it under perf. > > If possible please submit this for selftests. > It is more of a microbenchmark of options to flush an rbtree than a self-test. Further, it relies on the tools/lib/rbtree.c versus lib/rbtree.c. The tools/lib version was imported by Arnaldo in July 2015 and is a out of date, though it is good enough to show the intent w.r.t. flushing options.
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Sat, 23 Sep 2017 11:07:28 -0700 > From: Eric Dumazet <edumazet@google.com> > > While running TCP tests involving netem storing millions of packets, > I had the idea to speed up tfifo_reset() and did experiments. > > I tried the rbtree_postorder_for_each_entry_safe() method that is > used in skb_rbtree_purge() but discovered it was slower than the > current tfifo_reset() method. > > I measured time taken to release skbs with three occupation levels : > 10^4, 10^5 and 10^6 skbs with three methods : ... > Results : ... > I will also change skb_rbtree_purge() in a second patch. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Applied, thanks Eric.
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c index 063a4bdb9ee6f26b01387959e8f6ccd15ec16191..5a4f1008029068372019a965186e7a3c0a18aac3 100644 --- a/net/sched/sch_netem.c +++ b/net/sched/sch_netem.c @@ -361,12 +361,13 @@ static psched_time_t packet_len_2_sched_time(unsigned int len, struct netem_sche static void tfifo_reset(struct Qdisc *sch) { struct netem_sched_data *q = qdisc_priv(sch); - struct rb_node *p; + struct rb_node *p = rb_first(&q->t_root); - while ((p = rb_first(&q->t_root))) { + while (p) { struct sk_buff *skb = netem_rb_to_skb(p); - rb_erase(p, &q->t_root); + p = rb_next(p); + rb_erase(&skb->rbnode, &q->t_root); rtnl_kfree_skbs(skb, skb); } }