diff mbox

[net-next,6/6] net_sched: act: remove spinlock in fast path

Message ID 1435842455-30501-7-git-send-email-edumazet@google.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet July 2, 2015, 1:07 p.m. UTC
Final step for gact RCU operation :

1) Use percpu stats
2) update lastuse only every clock tick
3) Remove spinlock acquisition, as it is no longer needed.

Since this is the last contended lock in packet RX when tc gact is used,
this gives impressive gain.

My host with 8 RX queues was handling 5 Mpps before the patch,
and more than 10 Mpps after patch.

Tested:

On receiver :
IP=ip
TC=tc
dev=eth0

$TC qdisc del dev $dev ingress 2>/dev/null
$TC qdisc add dev $dev ingress
$TC filter del dev $dev root pref 10 2>/dev/null
$TC filter del dev $dev pref 10 2>/dev/null
tc filter add dev $dev est 1sec 4sec parent ffff: protocol ip prio 1 \
	u32 match ip src 7.0.0.0/8 flowid 1:15 action drop

Sender sends packets flood from 7/8 network

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
---
 net/sched/act_gact.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

Comments

John Fastabend July 2, 2015, 4:35 p.m. UTC | #1
On 15-07-02 06:07 AM, Eric Dumazet wrote:
> Final step for gact RCU operation :
> 
> 1) Use percpu stats
> 2) update lastuse only every clock tick
> 3) Remove spinlock acquisition, as it is no longer needed.
> 
> Since this is the last contended lock in packet RX when tc gact is used,
> this gives impressive gain.
> 
> My host with 8 RX queues was handling 5 Mpps before the patch,
> and more than 10 Mpps after patch.
> 
> Tested:
> 
> On receiver :
> IP=ip
> TC=tc
> dev=eth0
> 
> $TC qdisc del dev $dev ingress 2>/dev/null
> $TC qdisc add dev $dev ingress
> $TC filter del dev $dev root pref 10 2>/dev/null
> $TC filter del dev $dev pref 10 2>/dev/null
> tc filter add dev $dev est 1sec 4sec parent ffff: protocol ip prio 1 \
> 	u32 match ip src 7.0.0.0/8 flowid 1:15 action drop
> 
> Sender sends packets flood from 7/8 network
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Alexei Starovoitov <ast@plumgrid.com>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/sched/act_gact.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)

[...]

> @@ -121,9 +121,8 @@ static int tcf_gact(struct sk_buff *skb, const struct tc_action *a,
>  		    struct tcf_result *res)
>  {
>  	struct tcf_gact *gact = a->priv;
> -	int action = gact->tcf_action;
> +	int action = READ_ONCE(gact->tcf_action);
>  
> -	spin_lock(&gact->tcf_lock);
>  #ifdef CONFIG_GACT_PROB
>  	{
>  	u32 ptype = READ_ONCE(gact->tcfg_ptype);
> @@ -132,12 +131,11 @@ static int tcf_gact(struct sk_buff *skb, const struct tc_action *a,
>  		action = gact_rand[ptype](gact);
>  	}
>  #endif
> -	gact->tcf_bstats.bytes += qdisc_pkt_len(skb);
> -	gact->tcf_bstats.packets++;
> +	bstats_cpu_update(this_cpu_ptr(gact->common.cpu_bstats), skb);
>  	if (action == TC_ACT_SHOT)
> -		gact->tcf_qstats.drops++;
> -	gact->tcf_tm.lastuse = jiffies;
> -	spin_unlock(&gact->tcf_lock);
> +		qstats_drop_inc(this_cpu_ptr(gact->common.cpu_qstats));
> +	if (gact->tcf_tm.lastuse != jiffies)
> +		gact->tcf_tm.lastuse = jiffies;

I'm missing the point of the if block. Is that really good enough
for the 32bit system case? I would have expected some wrapper to
handle it here something like u64_stats_() maybe _u64_jiffies(). Maybe
after a coffee I'll make sense of it.

>  
>  	return action;
>  }
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov July 2, 2015, 5:13 p.m. UTC | #2
On 7/2/15 6:07 AM, Eric Dumazet wrote:
> Final step for gact RCU operation :
>
> 1) Use percpu stats
> 2) update lastuse only every clock tick
> 3) Remove spinlock acquisition, as it is no longer needed.
>
> Since this is the last contended lock in packet RX when tc gact is used,
> this gives impressive gain.
>
> My host with 8 RX queues was handling 5 Mpps before the patch,
> and more than 10 Mpps after patch.
>
> Signed-off-by: Eric Dumazet<edumazet@google.com>

Great stuff. Thank you for fixing it!

Acked-by: Alexei Starovoitov <ast@plumgrid.com>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet July 2, 2015, 8:59 p.m. UTC | #3
On Thu, 2015-07-02 at 09:35 -0700, John Fastabend wrote:
> > +	if (gact->tcf_tm.lastuse != jiffies)
> > +		gact->tcf_tm.lastuse = jiffies;
> 
> I'm missing the point of the if block. Is that really good enough
> for the 32bit system case? I would have expected some wrapper to
> handle it here something like u64_stats_() maybe _u64_jiffies(). Maybe
> after a coffee I'll make sense of it.
> 

Point is to not dirty cache line for every packet ?

Doing the test means we attempt dirtying only ~HZ times per second,
which really matters to handle millions of packets per second.

My tests show a good enough performance, not sure we want a percpu thing
for this lastuse field.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim July 3, 2015, 11:07 a.m. UTC | #4
On 07/02/15 09:07, Eric Dumazet wrote:
> Final step for gact RCU operation :
>
> 1) Use percpu stats
> 2) update lastuse only every clock tick
> 3) Remove spinlock acquisition, as it is no longer needed.
>
> Since this is the last contended lock in packet RX when tc gact is used,
> this gives impressive gain.
>
> My host with 8 RX queues was handling 5 Mpps before the patch,
> and more than 10 Mpps after patch.
>
> Tested:
>
> On receiver :
> IP=ip
> TC=tc
> dev=eth0
>
> $TC qdisc del dev $dev ingress 2>/dev/null
> $TC qdisc add dev $dev ingress
> $TC filter del dev $dev root pref 10 2>/dev/null
> $TC filter del dev $dev pref 10 2>/dev/null
> tc filter add dev $dev est 1sec 4sec parent ffff: protocol ip prio 1 \
> 	u32 match ip src 7.0.0.0/8 flowid 1:15 action drop
>
> Sender sends packets flood from 7/8 network
>

Very nice Eric;-> thanks.
So now basic accept/drop should be flying ;->
Other really low hanging fruit is act_csum, vlan and skbedit.
CCing the respective authors.

> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Alexei Starovoitov <ast@plumgrid.com>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim July 3, 2015, 11:25 a.m. UTC | #5
On 07/02/15 16:59, Eric Dumazet wrote:
> On Thu, 2015-07-02 at 09:35 -0700, John Fastabend wrote:

>
> Point is to not dirty cache line for every packet ?
>
> Doing the test means we attempt dirtying only ~HZ times per second,
> which really matters to handle millions of packets per second.
>
> My tests show a good enough performance, not sure we want a percpu thing
> for this lastuse field.
>

Does it harm to always set gact->tcf_tm.lastuse ?

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet July 3, 2015, 12:08 p.m. UTC | #6
On Fri, 2015-07-03 at 07:25 -0400, Jamal Hadi Salim wrote:
> On 07/02/15 16:59, Eric Dumazet wrote:
> > On Thu, 2015-07-02 at 09:35 -0700, John Fastabend wrote:
> 
> >
> > Point is to not dirty cache line for every packet ?
> >
> > Doing the test means we attempt dirtying only ~HZ times per second,
> > which really matters to handle millions of packets per second.
> >
> > My tests show a good enough performance, not sure we want a percpu thing
> > for this lastuse field.
> >
> 
> Does it harm to always set gact->tcf_tm.lastuse ?


Yes it harms, because of one false sharing on a cache line, for every
packet...

Samples: 20K of event 'cycles:pp', Event count (approx.): 15451660731                                                                                       
 32.50%  ksoftirqd/1  [kernel.kallsyms]  [k] tcf_gact
 19.43%  ksoftirqd/1  [kernel.kallsyms]  [k] put_compound_page
  4.34%  ksoftirqd/1  [kernel.kallsyms]  [k] __skb_get_hash
  3.88%  ksoftirqd/1  [kernel.kallsyms]  [k] memcpy_erms
  2.95%  ksoftirqd/1  [kernel.kallsyms]  [k] get_rps_cpu
  2.81%  ksoftirqd/1  [kernel.kallsyms]  [k] __build_skb


Note that gact->tcf_action & gact->tcf_tm.lastuse share same cache line


      │    Disassembly of section load0:
       │
       │    ffffffffa016f040 <load0>:
  0.11 │      nop
  0.06 │      push   %rbp
  0.14 │      mov    %rsp,%rbp
  0.06 │      push   %r12
       │      mov    %rdi,%r12
       │      push   %rbx
  0.16 │      mov    (%rsi),%rbx
 98.46 │      mov    0x20(%rbx),%eax         action = READ_ONCE(gact->tcf_action); 

_huge_ stall here because another cpu dirtied gact->tcf_tm.lastuse

  0.33 │      movzwl 0x98(%rbx),%edx
       │      test   %edx,%edx
       │      jne    87
       │20:   mov    0x88(%rbx),%rdx
  0.11 │      add    %gs:0x5fe9b0b9(%rip),%rdx
       │      mov    0x28(%r12),%ecx
  0.02 │      mov    0x8(%rdx),%edi
       │      mov    $0x1,%esi
  0.11 │      add    %rcx,(%rdx)
       │      mov    0xc8(%r12),%ecx
       │      add    0xd0(%r12),%rcx
  0.08 │      cmpw   $0x0,0x2(%rcx)
       │      je     5a
       │      movzwl 0x4(%rcx),%esi
       │5a:   add    %edi,%esi
       │      cmp    $0x2,%eax
  0.14 │      mov    %esi,0x8(%rdx)
       │      jne    77
       │      mov    0x90(%rbx),%rdx
       │      add    %gs:0x5fe9b075(%rip),%rdx
  0.11 │      addl   $0x1,0x8(%rdx)
       │77:   mov    -0x1e5640be(%rip),%rdx
       │      mov    %rdx,0x30(%rbx)                 gact->tcf_tm.lastuse = jiffies;
  0.08 │      pop    %rbx
  0.02 │      pop    %r12
       │      pop    %rbp
       │      retq
       │87:   mov    %edx,%edx
       │      mov    %rbx,%rdi
       │      callq  *-0x5fe90b70(,%rdx,8)
       │      jmp    20


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend July 3, 2015, 5:21 p.m. UTC | #7
On 15-07-03 05:08 AM, Eric Dumazet wrote:
> On Fri, 2015-07-03 at 07:25 -0400, Jamal Hadi Salim wrote:
>> On 07/02/15 16:59, Eric Dumazet wrote:
>>> On Thu, 2015-07-02 at 09:35 -0700, John Fastabend wrote:
>>
>>>
>>> Point is to not dirty cache line for every packet ?
>>>
>>> Doing the test means we attempt dirtying only ~HZ times per second,
>>> which really matters to handle millions of packets per second.
>>>
>>> My tests show a good enough performance, not sure we want a percpu thing
>>> for this lastuse field.

Agreed percpu is overkill thanks for the explanation. Looks good
to me.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet July 4, 2015, 5:45 a.m. UTC | #8
On Fri, Jul 3, 2015 at 1:07 PM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> Very nice Eric;-> thanks.
> So now basic accept/drop should be flying ;->
> Other really low hanging fruit is act_csum, vlan and skbedit.
> CCing the respective authors.

Note that act_mirred can be converted as well. I will do this because
this is my main usage.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index 7c7e72e95943..e054e5630aab 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -88,7 +88,7 @@  static int tcf_gact_init(struct net *net, struct nlattr *nla,
 
 	if (!tcf_hash_check(parm->index, a, bind)) {
 		ret = tcf_hash_create(parm->index, est, a, sizeof(*gact),
-				      bind, false);
+				      bind, true);
 		if (ret)
 			return ret;
 		ret = ACT_P_CREATED;
@@ -121,9 +121,8 @@  static int tcf_gact(struct sk_buff *skb, const struct tc_action *a,
 		    struct tcf_result *res)
 {
 	struct tcf_gact *gact = a->priv;
-	int action = gact->tcf_action;
+	int action = READ_ONCE(gact->tcf_action);
 
-	spin_lock(&gact->tcf_lock);
 #ifdef CONFIG_GACT_PROB
 	{
 	u32 ptype = READ_ONCE(gact->tcfg_ptype);
@@ -132,12 +131,11 @@  static int tcf_gact(struct sk_buff *skb, const struct tc_action *a,
 		action = gact_rand[ptype](gact);
 	}
 #endif
-	gact->tcf_bstats.bytes += qdisc_pkt_len(skb);
-	gact->tcf_bstats.packets++;
+	bstats_cpu_update(this_cpu_ptr(gact->common.cpu_bstats), skb);
 	if (action == TC_ACT_SHOT)
-		gact->tcf_qstats.drops++;
-	gact->tcf_tm.lastuse = jiffies;
-	spin_unlock(&gact->tcf_lock);
+		qstats_drop_inc(this_cpu_ptr(gact->common.cpu_qstats));
+	if (gact->tcf_tm.lastuse != jiffies)
+		gact->tcf_tm.lastuse = jiffies;
 
 	return action;
 }