diff mbox

[net-next] net: optimize csum_replace2()

Message ID 1395629496.9117.47.camel@edumazet-glaptop2.roam.corp.google.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet March 24, 2014, 2:51 a.m. UTC
From: Eric Dumazet <edumazet@google.com>

When changing one 16bit value by another in IP header, we can adjust the
IP checksum by doing a simple operation described in RFC 1624,
as reminded by David.

csum_partial() is a complex function on x86_64, not really suited
for small number of checksummed bytes.

I spotted csum_partial() being in the top 20 most consuming
functions (more than 1 %) in a GRO workload, which was rather
unexpected.

The caller was inet_gro_complete() doing a csum_replace2() when
building the new IP header for the GRO packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/checksum.h |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller March 24, 2014, 4:20 a.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 23 Mar 2014 19:51:36 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> When changing one 16bit value by another in IP header, we can adjust the
> IP checksum by doing a simple operation described in RFC 1624,
> as reminded by David.
> 
> csum_partial() is a complex function on x86_64, not really suited
> for small number of checksummed bytes.
> 
> I spotted csum_partial() being in the top 20 most consuming
> functions (more than 1 %) in a GRO workload, which was rather
> unexpected.
> 
> The caller was inet_gro_complete() doing a csum_replace2() when
> building the new IP header for the GRO packet.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks for following through on this Eric.

Would be nice to improve csum_replace4() similarly, since every
NAT packet uses that thing when we change the address in the
protocol header.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 24, 2014, 1:25 p.m. UTC | #2
On Mon, 2014-03-24 at 10:22 +0000, David Laight wrote:
> From: Eric Dumazet <edumazet@google.com>
> > 
> > When changing one 16bit value by another in IP header, we can adjust the
> > IP checksum by doing a simple operation described in RFC 1624,
> > as reminded by David.
> > 
> > csum_partial() is a complex function on x86_64, not really suited
> > for small number of checksummed bytes.
> > 
> > I spotted csum_partial() being in the top 20 most consuming
> > functions (more than 1 %) in a GRO workload, which was rather
> > unexpected.
> > 
> > The caller was inet_gro_complete() doing a csum_replace2() when
> > building the new IP header for the GRO packet.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  include/net/checksum.h |   23 +++++++++++++++++++++--
> >  1 file changed, 21 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/net/checksum.h b/include/net/checksum.h
> > index 37a0e24adbe7..a28f4e0f6251 100644
> > --- a/include/net/checksum.h
> > +++ b/include/net/checksum.h
> > @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend)
> >  	return csum_add(csum, ~addend);
> >  }
> > 
> > +static inline __sum16 csum16_add(__sum16 csum, __be16 addend)
> > +{
> > +	u16 res = (__force u16)csum;
> 
> Shouldn't that be u32 ?

Why ? We compute 16bit checksums here.

> 
> > +	res += (__force u16)addend;
> > +	return (__force __sum16)(res + (res < (__force u16)addend));

Note how carry is propagated, and how we return 16 bit anyway.

Using u32 would force to use a fold, which is more expensive.

> > +}
> > +
> > +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend)
> > +{
> > +	return csum16_add(csum, ~addend);
> > +}
> > +
> >  static inline __wsum
> >  csum_block_add(__wsum csum, __wsum csum2, int offset)
> >  {
> > @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to)
> >  	*sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum)));
> >  }
> > 
> > -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to)
> > +/* Implements RFC 1624 (Incremental Internet Checksum)
> > + * 3. Discussion states :
> > + *     HC' = ~(~HC + ~m + m')
> > + *  m : old value of a 16bit field
> > + *  m' : new value of a 16bit field
> > + */
> > +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new)
> >  {
> > -	csum_replace4(sum, (__force __be32)from, (__force __be32)to);
> > +	*sum = ~csum16_add(csum16_sub(~(*sum), old), new);
> >  }
> 
> It might be clearer to just say:
> 	*sum = ~csum16_add(csum16_add(~*sum, ~old), new));
> or even:
> 	*sum = ~csum16_add(csum16_add(*sum ^ 0xffff, old ^ 0xffff), new));
> which might remove some mask instructions - especially if all the
> intermediate values are left larger than 16 bits.

We have csum_add() and csum_sub(), I added csum16_add() and
csum16_sub(), for analogy and completeness.

For linux guys, its quite common stuff to use csum_sub(x, y) instead of
csum_add(c, ~y).

You can use whatever code matching your taste.

RFC 1624 mentions ~m, not m ^ 0xffff

But again if you prefer m ^ 0xffff, thats up to you ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight March 24, 2014, 2:38 p.m. UTC | #3
From: Eric Dumazet 

> On Mon, 2014-03-24 at 10:22 +0000, David Laight wrote:

> > From: Eric Dumazet <edumazet@google.com>

> > >

> > > When changing one 16bit value by another in IP header, we can adjust the

> > > IP checksum by doing a simple operation described in RFC 1624,

> > > as reminded by David.

> > >

> > > csum_partial() is a complex function on x86_64, not really suited

> > > for small number of checksummed bytes.

> > >

> > > I spotted csum_partial() being in the top 20 most consuming

> > > functions (more than 1 %) in a GRO workload, which was rather

> > > unexpected.

> > >

> > > The caller was inet_gro_complete() doing a csum_replace2() when

> > > building the new IP header for the GRO packet.

> > >

> > > Signed-off-by: Eric Dumazet <edumazet@google.com>

> > > ---

> > >  include/net/checksum.h |   23 +++++++++++++++++++++--

> > >  1 file changed, 21 insertions(+), 2 deletions(-)

> > >

> > > diff --git a/include/net/checksum.h b/include/net/checksum.h

> > > index 37a0e24adbe7..a28f4e0f6251 100644

> > > --- a/include/net/checksum.h

> > > +++ b/include/net/checksum.h

> > > @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend)

> > >  	return csum_add(csum, ~addend);

> > >  }

> > >

> > > +static inline __sum16 csum16_add(__sum16 csum, __be16 addend)

> > > +{

> > > +	u16 res = (__force u16)csum;

> >

> > Shouldn't that be u32 ?

> 

> Why ? We compute 16bit checksums here.

> 

> >

> > > +	res += (__force u16)addend;

> > > +	return (__force __sum16)(res + (res < (__force u16)addend));

> 

> Note how carry is propagated, and how we return 16 bit anyway.


Not enough coffee - my brain didn't parse that correctly at all :-(

> Using u32 would force to use a fold, which is more expensive.


Try compiling for something other than x86/amd64
(ie something without 16bit arithmetic).
The ppc compiler I have (not the latest( generates 3 masks with 0xffff and
a conditional branch for the above function (without the inline).

> > > +}

> > > +

> > > +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend)

> > > +{

> > > +	return csum16_add(csum, ~addend);

> > > +}

> > > +

> > >  static inline __wsum

> > >  csum_block_add(__wsum csum, __wsum csum2, int offset)

> > >  {

> > > @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to)

> > >  	*sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum)));

> > >  }

> > >

> > > -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to)

> > > +/* Implements RFC 1624 (Incremental Internet Checksum)

> > > + * 3. Discussion states :

> > > + *     HC' = ~(~HC + ~m + m')

> > > + *  m : old value of a 16bit field

> > > + *  m' : new value of a 16bit field

> > > + */

> > > +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new)

> > >  {

> > > -	csum_replace4(sum, (__force __be32)from, (__force __be32)to);

> > > +	*sum = ~csum16_add(csum16_sub(~(*sum), old), new);

> > >  }

> >

> > It might be clearer to just say:

> > 	*sum = ~csum16_add(csum16_add(~*sum, ~old), new));

> > or even:

> > 	*sum = ~csum16_add(csum16_add(*sum ^ 0xffff, old ^ 0xffff), new));

> > which might remove some mask instructions - especially if all the

> > intermediate values are left larger than 16 bits.

> 

> We have csum_add() and csum_sub(), I added csum16_add() and

> csum16_sub(), for analogy and completeness.

> 

> For linux guys, its quite common stuff to use csum_sub(x, y) instead of

> csum_add(c, ~y).

> 

> You can use whatever code matching your taste.

> 

> RFC 1624 mentions ~m, not m ^ 0xffff


But C only has 32bit maths, so all the 16bit values keep
need masking.

> But again if you prefer m ^ 0xffff, thats up to you ;)


On ppc I get much better code for (retyped):
static inline u32 csum_add(u32 c, u32 a)
{
	c += a'
	return (c + (c >> 16)) & 0xffff;
}
void csum_repl(u16 *c16, u16 o16, u16 n)
{
	u32 c = *c16, o = o16;
	*c16 = csum_add(csum_add(c ^ 0xffff, o ^ 0xffff), n);
}
13 instructions and no branches against 21 instructions
and 2 branches.
The same is probably true of arm.
For amd64 (where the compiler can do a 16bit compare)
my version is one instruction longer.
I've not tried to guess at the actual cycle counts!

	David
Eric Dumazet March 24, 2014, 3:14 p.m. UTC | #4
On Mon, 2014-03-24 at 14:38 +0000, David Laight wrote:
> But C only has 32bit maths, so all the 16bit values keep
> need masking.
> 
> > But again if you prefer m ^ 0xffff, thats up to you ;)
> 
> On ppc I get much better code for (retyped):
> static inline u32 csum_add(u32 c, u32 a)
> {
> 	c += a'
> 	return (c + (c >> 16)) & 0xffff;
> }
> void csum_repl(u16 *c16, u16 o16, u16 n)
> {
> 	u32 c = *c16, o = o16;
> 	*c16 = csum_add(csum_add(c ^ 0xffff, o ^ 0xffff), n);
> }
> 13 instructions and no branches against 21 instructions
> and 2 branches.
> The same is probably true of arm.
> For amd64 (where the compiler can do a 16bit compare)
> my version is one instruction longer.
> I've not tried to guess at the actual cycle counts!
> 
> 	David

Code I provided uses no conditional branch on x86.

It sounds you could provide helper to arm, if you really care of this
path. I find surprising you did not comment on my prior mails on this
subject and you suddenly seem to care now the patch is merged.

We for example have the following helper in x86 :

static inline unsigned add32_with_carry(unsigned a, unsigned b)
{
        asm("addl %2,%0\n\t"
            "adcl $0,%0"
            : "=r" (a)
            : "0" (a), "r" (b));
        return a;
}

But these days, gcc seems to do a pretty good job without these helpers.

Yes we could save 4 instructions, but I doubt it is worth the pain of
adding arch helpers. I certainly wont do it.

0000000000000030 <inet_gro_complete>:
      30:	55                   	push   %rbp
      31:	48 63 c6             	movslq %esi,%rax
      34:	48 03 87 d0 00 00 00 	add    0xd0(%rdi),%rax
      3b:	48 89 e5             	mov    %rsp,%rbp
      3e:	8b 57 68             	mov    0x68(%rdi),%edx
      41:	44 0f b7 40 02       	movzwl 0x2(%rax),%r8d
      46:	0f b7 48 0a          	movzwl 0xa(%rax),%ecx
      4a:	66 29 f2             	sub    %si,%dx
      4d:	66 c1 c2 08          	rol    $0x8,%dx
      51:	44 0f b6 48 09       	movzbl 0x9(%rax),%r9d
      56:	66 89 50 02          	mov    %dx,0x2(%rax)
      5a:	41 f7 d0             	not    %r8d
      5d:	f7 d1                	not    %ecx
      5f:	66 44 01 c1          	add    %r8w,%cx
      63:	41 0f 92 c0          	setb   %r8b
      67:	45 0f b6 c0          	movzbl %r8b,%r8d
      6b:	44 01 c1             	add    %r8d,%ecx
      6e:	66 01 d1             	add    %dx,%cx
      71:	41 0f 92 c0          	setb   %r8b
      75:	45 0f b6 c0          	movzbl %r8b,%r8d
      79:	44 01 c1             	add    %r8d,%ecx
      7c:	f7 d1                	not    %ecx
      7e:	66 89 48 0a          	mov    %cx,0xa(%rax)
      82:	49 63 c1             	movslq %r9d,%rax
      85:	48 8b 04 c5 00 00 00 	mov    0x0(,%rax,8),%rax


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight March 24, 2014, 3:52 p.m. UTC | #5
From: Eric Dumazet

...
> Code I provided uses no conditional branch on x86.


All the world isn't x86...
The sparc, ppc and arm people might want to consider optimising
this code further.

> It sounds you could provide helper to arm, if you really care of this

> path. I find surprising you did not comment on my prior mails on this

> subject and you suddenly seem to care now the patch is merged.


I don't remember seeing this particular patch before today.
Last week you were still sorting out the stall caused by
the 16bit write -> 32bit load in the existing code.

In any case your change is clearly a significant improvement on
what was there before.

> We for example have the following helper in x86 :

> 

> static inline unsigned add32_with_carry(unsigned a, unsigned b)

> {

>         asm("addl %2,%0\n\t"

>             "adcl $0,%0"

>             : "=r" (a)

>             : "0" (a), "r" (b));

>         return a;

> }

> 

> But these days, gcc seems to do a pretty good job without these helpers.


Indeed.

While x86 can do 16bit maths, most cpus can't - so the generated
code for 'short' (and 'char') maths must mask with 0xffff (or 0xff)
every time a value is written to a local (ie register) variable.
In general you get better code by using local variables that
are the same size as machine registers.
This also applies to function arguments and return types.

I'm not sure how much difference it would make overall.
It rather depends on whether anything appears in a very
hot path. OTOH a lot of mall changes can add together.

	David
diff mbox

Patch

diff --git a/include/net/checksum.h b/include/net/checksum.h
index 37a0e24adbe7..a28f4e0f6251 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -69,6 +69,19 @@  static inline __wsum csum_sub(__wsum csum, __wsum addend)
 	return csum_add(csum, ~addend);
 }
 
+static inline __sum16 csum16_add(__sum16 csum, __be16 addend)
+{
+	u16 res = (__force u16)csum;
+
+	res += (__force u16)addend;
+	return (__force __sum16)(res + (res < (__force u16)addend));
+}
+
+static inline __sum16 csum16_sub(__sum16 csum, __be16 addend)
+{
+	return csum16_add(csum, ~addend);
+}
+
 static inline __wsum
 csum_block_add(__wsum csum, __wsum csum2, int offset)
 {
@@ -112,9 +125,15 @@  static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to)
 	*sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum)));
 }
 
-static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to)
+/* Implements RFC 1624 (Incremental Internet Checksum)
+ * 3. Discussion states :
+ *     HC' = ~(~HC + ~m + m')
+ *  m : old value of a 16bit field
+ *  m' : new value of a 16bit field
+ */
+static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new)
 {
-	csum_replace4(sum, (__force __be32)from, (__force __be32)to);
+	*sum = ~csum16_add(csum16_sub(~(*sum), old), new);
 }
 
 struct sk_buff;