Message ID | 1435842455-30501-7-git-send-email-edumazet@google.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
On 15-07-02 06:07 AM, Eric Dumazet wrote: > Final step for gact RCU operation : > > 1) Use percpu stats > 2) update lastuse only every clock tick > 3) Remove spinlock acquisition, as it is no longer needed. > > Since this is the last contended lock in packet RX when tc gact is used, > this gives impressive gain. > > My host with 8 RX queues was handling 5 Mpps before the patch, > and more than 10 Mpps after patch. > > Tested: > > On receiver : > IP=ip > TC=tc > dev=eth0 > > $TC qdisc del dev $dev ingress 2>/dev/null > $TC qdisc add dev $dev ingress > $TC filter del dev $dev root pref 10 2>/dev/null > $TC filter del dev $dev pref 10 2>/dev/null > tc filter add dev $dev est 1sec 4sec parent ffff: protocol ip prio 1 \ > u32 match ip src 7.0.0.0/8 flowid 1:15 action drop > > Sender sends packets flood from 7/8 network > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Alexei Starovoitov <ast@plumgrid.com> > Cc: Jamal Hadi Salim <jhs@mojatatu.com> > Cc: John Fastabend <john.fastabend@gmail.com> > --- > net/sched/act_gact.c | 14 ++++++-------- > 1 file changed, 6 insertions(+), 8 deletions(-) [...] > @@ -121,9 +121,8 @@ static int tcf_gact(struct sk_buff *skb, const struct tc_action *a, > struct tcf_result *res) > { > struct tcf_gact *gact = a->priv; > - int action = gact->tcf_action; > + int action = READ_ONCE(gact->tcf_action); > > - spin_lock(&gact->tcf_lock); > #ifdef CONFIG_GACT_PROB > { > u32 ptype = READ_ONCE(gact->tcfg_ptype); > @@ -132,12 +131,11 @@ static int tcf_gact(struct sk_buff *skb, const struct tc_action *a, > action = gact_rand[ptype](gact); > } > #endif > - gact->tcf_bstats.bytes += qdisc_pkt_len(skb); > - gact->tcf_bstats.packets++; > + bstats_cpu_update(this_cpu_ptr(gact->common.cpu_bstats), skb); > if (action == TC_ACT_SHOT) > - gact->tcf_qstats.drops++; > - gact->tcf_tm.lastuse = jiffies; > - spin_unlock(&gact->tcf_lock); > + qstats_drop_inc(this_cpu_ptr(gact->common.cpu_qstats)); > + if (gact->tcf_tm.lastuse != jiffies) > + gact->tcf_tm.lastuse = jiffies; I'm missing the point of the if block. Is that really good enough for the 32bit system case? I would have expected some wrapper to handle it here something like u64_stats_() maybe _u64_jiffies(). Maybe after a coffee I'll make sense of it. > > return action; > } > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7/2/15 6:07 AM, Eric Dumazet wrote: > Final step for gact RCU operation : > > 1) Use percpu stats > 2) update lastuse only every clock tick > 3) Remove spinlock acquisition, as it is no longer needed. > > Since this is the last contended lock in packet RX when tc gact is used, > this gives impressive gain. > > My host with 8 RX queues was handling 5 Mpps before the patch, > and more than 10 Mpps after patch. > > Signed-off-by: Eric Dumazet<edumazet@google.com> Great stuff. Thank you for fixing it! Acked-by: Alexei Starovoitov <ast@plumgrid.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2015-07-02 at 09:35 -0700, John Fastabend wrote: > > + if (gact->tcf_tm.lastuse != jiffies) > > + gact->tcf_tm.lastuse = jiffies; > > I'm missing the point of the if block. Is that really good enough > for the 32bit system case? I would have expected some wrapper to > handle it here something like u64_stats_() maybe _u64_jiffies(). Maybe > after a coffee I'll make sense of it. > Point is to not dirty cache line for every packet ? Doing the test means we attempt dirtying only ~HZ times per second, which really matters to handle millions of packets per second. My tests show a good enough performance, not sure we want a percpu thing for this lastuse field. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/02/15 09:07, Eric Dumazet wrote: > Final step for gact RCU operation : > > 1) Use percpu stats > 2) update lastuse only every clock tick > 3) Remove spinlock acquisition, as it is no longer needed. > > Since this is the last contended lock in packet RX when tc gact is used, > this gives impressive gain. > > My host with 8 RX queues was handling 5 Mpps before the patch, > and more than 10 Mpps after patch. > > Tested: > > On receiver : > IP=ip > TC=tc > dev=eth0 > > $TC qdisc del dev $dev ingress 2>/dev/null > $TC qdisc add dev $dev ingress > $TC filter del dev $dev root pref 10 2>/dev/null > $TC filter del dev $dev pref 10 2>/dev/null > tc filter add dev $dev est 1sec 4sec parent ffff: protocol ip prio 1 \ > u32 match ip src 7.0.0.0/8 flowid 1:15 action drop > > Sender sends packets flood from 7/8 network > Very nice Eric;-> thanks. So now basic accept/drop should be flying ;-> Other really low hanging fruit is act_csum, vlan and skbedit. CCing the respective authors. > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Alexei Starovoitov <ast@plumgrid.com> > Cc: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/02/15 16:59, Eric Dumazet wrote: > On Thu, 2015-07-02 at 09:35 -0700, John Fastabend wrote: > > Point is to not dirty cache line for every packet ? > > Doing the test means we attempt dirtying only ~HZ times per second, > which really matters to handle millions of packets per second. > > My tests show a good enough performance, not sure we want a percpu thing > for this lastuse field. > Does it harm to always set gact->tcf_tm.lastuse ? cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-07-03 at 07:25 -0400, Jamal Hadi Salim wrote: > On 07/02/15 16:59, Eric Dumazet wrote: > > On Thu, 2015-07-02 at 09:35 -0700, John Fastabend wrote: > > > > > Point is to not dirty cache line for every packet ? > > > > Doing the test means we attempt dirtying only ~HZ times per second, > > which really matters to handle millions of packets per second. > > > > My tests show a good enough performance, not sure we want a percpu thing > > for this lastuse field. > > > > Does it harm to always set gact->tcf_tm.lastuse ? Yes it harms, because of one false sharing on a cache line, for every packet... Samples: 20K of event 'cycles:pp', Event count (approx.): 15451660731 32.50% ksoftirqd/1 [kernel.kallsyms] [k] tcf_gact 19.43% ksoftirqd/1 [kernel.kallsyms] [k] put_compound_page 4.34% ksoftirqd/1 [kernel.kallsyms] [k] __skb_get_hash 3.88% ksoftirqd/1 [kernel.kallsyms] [k] memcpy_erms 2.95% ksoftirqd/1 [kernel.kallsyms] [k] get_rps_cpu 2.81% ksoftirqd/1 [kernel.kallsyms] [k] __build_skb Note that gact->tcf_action & gact->tcf_tm.lastuse share same cache line │ Disassembly of section load0: │ │ ffffffffa016f040 <load0>: 0.11 │ nop 0.06 │ push %rbp 0.14 │ mov %rsp,%rbp 0.06 │ push %r12 │ mov %rdi,%r12 │ push %rbx 0.16 │ mov (%rsi),%rbx 98.46 │ mov 0x20(%rbx),%eax action = READ_ONCE(gact->tcf_action); _huge_ stall here because another cpu dirtied gact->tcf_tm.lastuse 0.33 │ movzwl 0x98(%rbx),%edx │ test %edx,%edx │ jne 87 │20: mov 0x88(%rbx),%rdx 0.11 │ add %gs:0x5fe9b0b9(%rip),%rdx │ mov 0x28(%r12),%ecx 0.02 │ mov 0x8(%rdx),%edi │ mov $0x1,%esi 0.11 │ add %rcx,(%rdx) │ mov 0xc8(%r12),%ecx │ add 0xd0(%r12),%rcx 0.08 │ cmpw $0x0,0x2(%rcx) │ je 5a │ movzwl 0x4(%rcx),%esi │5a: add %edi,%esi │ cmp $0x2,%eax 0.14 │ mov %esi,0x8(%rdx) │ jne 77 │ mov 0x90(%rbx),%rdx │ add %gs:0x5fe9b075(%rip),%rdx 0.11 │ addl $0x1,0x8(%rdx) │77: mov -0x1e5640be(%rip),%rdx │ mov %rdx,0x30(%rbx) gact->tcf_tm.lastuse = jiffies; 0.08 │ pop %rbx 0.02 │ pop %r12 │ pop %rbp │ retq │87: mov %edx,%edx │ mov %rbx,%rdi │ callq *-0x5fe90b70(,%rdx,8) │ jmp 20 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 15-07-03 05:08 AM, Eric Dumazet wrote: > On Fri, 2015-07-03 at 07:25 -0400, Jamal Hadi Salim wrote: >> On 07/02/15 16:59, Eric Dumazet wrote: >>> On Thu, 2015-07-02 at 09:35 -0700, John Fastabend wrote: >> >>> >>> Point is to not dirty cache line for every packet ? >>> >>> Doing the test means we attempt dirtying only ~HZ times per second, >>> which really matters to handle millions of packets per second. >>> >>> My tests show a good enough performance, not sure we want a percpu thing >>> for this lastuse field. Agreed percpu is overkill thanks for the explanation. Looks good to me. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 3, 2015 at 1:07 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote: > Very nice Eric;-> thanks. > So now basic accept/drop should be flying ;-> > Other really low hanging fruit is act_csum, vlan and skbedit. > CCing the respective authors. Note that act_mirred can be converted as well. I will do this because this is my main usage. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c index 7c7e72e95943..e054e5630aab 100644 --- a/net/sched/act_gact.c +++ b/net/sched/act_gact.c @@ -88,7 +88,7 @@ static int tcf_gact_init(struct net *net, struct nlattr *nla, if (!tcf_hash_check(parm->index, a, bind)) { ret = tcf_hash_create(parm->index, est, a, sizeof(*gact), - bind, false); + bind, true); if (ret) return ret; ret = ACT_P_CREATED; @@ -121,9 +121,8 @@ static int tcf_gact(struct sk_buff *skb, const struct tc_action *a, struct tcf_result *res) { struct tcf_gact *gact = a->priv; - int action = gact->tcf_action; + int action = READ_ONCE(gact->tcf_action); - spin_lock(&gact->tcf_lock); #ifdef CONFIG_GACT_PROB { u32 ptype = READ_ONCE(gact->tcfg_ptype); @@ -132,12 +131,11 @@ static int tcf_gact(struct sk_buff *skb, const struct tc_action *a, action = gact_rand[ptype](gact); } #endif - gact->tcf_bstats.bytes += qdisc_pkt_len(skb); - gact->tcf_bstats.packets++; + bstats_cpu_update(this_cpu_ptr(gact->common.cpu_bstats), skb); if (action == TC_ACT_SHOT) - gact->tcf_qstats.drops++; - gact->tcf_tm.lastuse = jiffies; - spin_unlock(&gact->tcf_lock); + qstats_drop_inc(this_cpu_ptr(gact->common.cpu_qstats)); + if (gact->tcf_tm.lastuse != jiffies) + gact->tcf_tm.lastuse = jiffies; return action; }
Final step for gact RCU operation : 1) Use percpu stats 2) update lastuse only every clock tick 3) Remove spinlock acquisition, as it is no longer needed. Since this is the last contended lock in packet RX when tc gact is used, this gives impressive gain. My host with 8 RX queues was handling 5 Mpps before the patch, and more than 10 Mpps after patch. Tested: On receiver : IP=ip TC=tc dev=eth0 $TC qdisc del dev $dev ingress 2>/dev/null $TC qdisc add dev $dev ingress $TC filter del dev $dev root pref 10 2>/dev/null $TC filter del dev $dev pref 10 2>/dev/null tc filter add dev $dev est 1sec 4sec parent ffff: protocol ip prio 1 \ u32 match ip src 7.0.0.0/8 flowid 1:15 action drop Sender sends packets flood from 7/8 network Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.fastabend@gmail.com> --- net/sched/act_gact.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-)