diff mbox

[v2,net-next] net: sched: run ingress qdisc without locks

Message ID 1430544448-19777-1-git-send-email-ast@plumgrid.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Alexei Starovoitov May 2, 2015, 5:27 a.m. UTC
From: John Fastabend <john.r.fastabend@intel.com>

TC classifiers/actions were converted to RCU by John in the series:
http://thread.gmane.org/gmane.linux.network/329739/focus=329739
and many follow on patches.
This is the last patch from that series that finally drops
ingress spin_lock.

Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.

In two cpu case when both cores are receiving traffic on the same
device and go into the same ingress+u32 the performance jumps
from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---

v1->v2: add From:John tag, Sob, Ack

 net/core/dev.c          |    2 --
 net/sched/sch_ingress.c |    5 +++--
 2 files changed, 3 insertions(+), 4 deletions(-)

Comments

Jesper Dangaard Brouer May 3, 2015, 3:42 p.m. UTC | #1
On Fri,  1 May 2015 22:27:28 -0700
Alexei Starovoitov <ast@plumgrid.com> wrote:

> From: John Fastabend <john.r.fastabend@intel.com>
> 
> TC classifiers/actions were converted to RCU by John in the series:
> http://thread.gmane.org/gmane.linux.network/329739/focus=329739
> and many follow on patches.
> This is the last patch from that series that finally drops
> ingress spin_lock.

I absolutely love this change.  It is a huge step for ingress
scalability.


> Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.

I was actually expecting to see a higher performance boost.

 (processing cost per packet)
 (1/(22.9*10^6)*10^9) = 43.67 ns
 (1/(24.5*10^6)*10^9) = 40.82 ns
 improvement diff     = -2.85 ns

The patch is removing two atomic operations, spin_{un,}lock, which I
have benchmarked[1] to cost approx 14ns on my system.  Your system
likely is faster, but not that much (p.s. benchmark your own system
with [1])

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

> In two cpu case when both cores are receiving traffic on the same
> device and go into the same ingress+u32 the performance jumps
> from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps

This looks good for scalability :-)))

> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Alexei Starovoitov May 4, 2015, 5:12 a.m. UTC | #2
On 5/3/15 8:42 AM, Jesper Dangaard Brouer wrote:
>
> I was actually expecting to see a higher performance boost.
 > improvement diff     = -2.85 ns
...
> The patch is removing two atomic operations, spin_{un,}lock, which I
> have benchmarked[1] to cost approx 14ns on my system.  Your system
> likely is faster, but not that much (p.s. benchmark your own system
> with [1])
>
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

have tried you tight loop spin_lock test on my box and it showed:
time_bench: Type:spin_lock_unlock Per elem: 40 cycles(tsc) 11.070 ns
and yet the total single cpu gain from removal of spin_lock/unlock
in ingress path is smaller than 11ns. I think this observation is
telling us that tight loop benchmarking is inherently flawed.
I'm guessing that uops that cmpxchg is broken into can execute in
parallel with uops of other insns, so tight loops of the same sequence
of uops has more alu dependencies whereas in more normal insn flow
these uops can mix and match better. Would be great if intel microarch
experts can chime in.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer May 4, 2015, 11:04 a.m. UTC | #3
On Sun, 03 May 2015 22:12:43 -0700
Alexei Starovoitov <ast@plumgrid.com> wrote:

> On 5/3/15 8:42 AM, Jesper Dangaard Brouer wrote:
> >
> > I was actually expecting to see a higher performance boost.
>  > improvement diff     = -2.85 ns
> ...
> > The patch is removing two atomic operations, spin_{un,}lock, which I
> > have benchmarked[1] to cost approx 14ns on my system.  Your system
> > likely is faster, but not that much (p.s. benchmark your own system
> > with [1])
> >
> > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
> 
> have tried you tight loop spin_lock test on my box and it showed:
> time_bench: Type:spin_lock_unlock Per elem: 40 cycles(tsc) 11.070 ns
> and yet the total single cpu gain from removal of spin_lock/unlock
> in ingress path is smaller than 11ns. I think this observation is
> telling us that tight loop benchmarking is inherently flawed.
> I'm guessing that uops that cmpxchg is broken into can execute in
> parallel with uops of other insns, so tight loops of the same sequence
> of uops has more alu dependencies whereas in more normal insn flow
> these uops can mix and match better. Would be great if intel microarch
> experts can chime in.

How do you activate the ingress code path?

I'm just doing (is this enough?):
 export DEV=eth4
 tc qdisc add dev $DEV handle ffff: ingress
 

I re-ran the experiment, and I can also only show a 2.68ns
improvement.  This is rather strange, and I cannot explain it.

The lock clearly shows up in perf report[1] with 12.23% raw_spin_lock,
and perf report[2] it clearly gone, but we don't see a 12% improvement
in performance, but around 4.7%.

Before activating qdisc ingress code : 25.3Mpps (25398057)
Activating qdisc ingress with lock   : 16.9Mpps (16989315)
Activating qdisc ingress without lock: 17.8Mpps (17800496)

(1/17800496*10^9)-(1/16989315*10^9) = -2.68 ns

The "cost" of activating the ingress qdisc is also interesting:
 (1/25398057*10^9)-(1/16989315*10^9) = -19.49 ns
 (1/25398057*10^9)-(1/17800496*10^9) = -16.81 ns
Alexei Starovoitov May 5, 2015, 1:27 a.m. UTC | #4
On 5/4/15 4:04 AM, Jesper Dangaard Brouer wrote:
>
> How do you activate the ingress code path?
>
> I'm just doing (is this enough?):
>   export DEV=eth4
>   tc qdisc add dev $DEV handle ffff: ingress

yes. plus my numbers also include u32 classifier.

> I re-ran the experiment, and I can also only show a 2.68ns
> improvement.  This is rather strange, and I cannot explain it.
>
> The lock clearly shows up in perf report[1] with 12.23% raw_spin_lock,
> and perf report[2] it clearly gone, but we don't see a 12% improvement
> in performance, but around 4.7%.

It's indeed puzzling. Hopefully intel experts can chime in.

> The "cost" of activating the ingress qdisc is also interesting:
>   (1/25398057*10^9)-(1/16989315*10^9) = -19.49 ns
>   (1/25398057*10^9)-(1/17800496*10^9) = -16.81 ns

yep, we're working hard on reducing it.
btw the cost of enabling rps without using it is ~8ns.
Our line rate goal is still a bit far, but hopefully getting closer :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 97a15ae8d07a..862875ec8f2f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3538,10 +3538,8 @@  static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
 
 	q = rcu_dereference(rxq->qdisc);
 	if (q != &noop_qdisc) {
-		spin_lock(qdisc_lock(q));
 		if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
 			result = qdisc_enqueue_root(skb, q);
-		spin_unlock(qdisc_lock(q));
 	}
 
 	return result;
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 4cdbfb85686a..a89cc3278bfb 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -65,11 +65,11 @@  static int ingress_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 
 	result = tc_classify(skb, fl, &res);
 
-	qdisc_bstats_update(sch, skb);
+	qdisc_bstats_update_cpu(sch, skb);
 	switch (result) {
 	case TC_ACT_SHOT:
 		result = TC_ACT_SHOT;
-		qdisc_qstats_drop(sch);
+		qdisc_qstats_drop_cpu(sch);
 		break;
 	case TC_ACT_STOLEN:
 	case TC_ACT_QUEUED:
@@ -91,6 +91,7 @@  static int ingress_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 static int ingress_init(struct Qdisc *sch, struct nlattr *opt)
 {
 	net_inc_ingress_queue();
+	sch->flags |= TCQ_F_CPUSTATS;
 
 	return 0;
 }