[v3,2/2] powerpc: add support for csum_add()

Message ID	1d1362c8aa696e316d3ba97dce2342df6f6ee6cf.1432047904.git.christophe.leroy@c-s.fr (mailing list archive)
State	Accepted
Delegated to:	Scott Wood
Headers	show Return-Path: <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org> Message-Id: <1d1362c8aa696e316d3ba97dce2342df6f6ee6cf.1432047904.git.christophe.leroy@c-s.fr> In-Reply-To: <cover.1432047904.git.christophe.leroy@c-s.fr> References: <cover.1432047904.git.christophe.leroy@c-s.fr> From: Christophe Leroy <christophe.leroy@c-s.fr> Subject: [PATCH v3 2/2] powerpc: add support for csum_add() To: Benjamin Herrenschmidt <benh@kernel.crashing.org>, Paul Mackerras <paulus@samba.org>, Michael Ellerman <mpe@ellerman.id.au>, scottwood@freescale.com Date: Tue, 19 May 2015 17:18:57 +0200 (CEST) Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>

Christophe Leroy May 19, 2015, 3:18 p.m. UTC

The C version of csum_add() as defined in include/net/checksum.h gives the
following assembly in ppc32:
       0:       7c 04 1a 14     add     r0,r4,r3
       4:       7c 64 00 10     subfc   r3,r4,r0
       8:       7c 63 19 10     subfe   r3,r3,r3
       c:       7c 63 00 50     subf    r3,r3,r0
and the following in ppc64:
   0xc000000000001af8 <+0>:	add     r3,r3,r4
   0xc000000000001afc <+4>:	cmplw   cr7,r3,r4
   0xc000000000001b00 <+8>:	mfcr    r4
   0xc000000000001b04 <+12>:	rlwinm  r4,r4,29,31,31
   0xc000000000001b08 <+16>:	add     r3,r4,r3
   0xc000000000001b0c <+20>:	clrldi  r3,r3,32
   0xc000000000001b10 <+24>:	blr

include/net/checksum.h also offers the possibility to define an arch specific
function.
This patch provides a specific csum_add() inline function.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/include/asm/checksum.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

David Laight May 22, 2015, 3:57 p.m. UTC | #1

From: Linuxppc-dev Christophe Leroy

> Sent: 19 May 2015 16:19

...
> diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h

> index 5e43d2d..e8d9ef4 100644

> --- a/arch/powerpc/include/asm/checksum.h

> +++ b/arch/powerpc/include/asm/checksum.h

> @@ -130,6 +130,22 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,

>  	return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));

>  }

> 

> +#define HAVE_ARCH_CSUM_ADD

> +static inline __wsum csum_add(__wsum csum, __wsum addend)

> +{

> +#ifdef __powerpc64__

> +	u64 res = (__force u64)csum;

> +

> +	res += (__force u64)addend;

> +	return (__force __wsum)((u32)res + (res >> 32));

> +#else

> +	asm("addc %0,%0,%1;"

> +	    "addze %0,%0;"

> +	    : "+r" (csum) : "r" (addend));

> +	return csum;

> +#endif


I'd have thought it better to test for the cpu type where you want the
'asm' variant, and then fall back on the C version for all others.
I know (well suspect) there are only two cases here.

I'd also have thought that the 64bit C version above would be generally 'good'.

	David

Scott Wood May 22, 2015, 7:32 p.m. UTC | #2

On Fri, 2015-05-22 at 15:57 +0000, David Laight wrote:
> From: Linuxppc-dev Christophe Leroy
> > Sent: 19 May 2015 16:19
> ...
> > diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
> > index 5e43d2d..e8d9ef4 100644
> > --- a/arch/powerpc/include/asm/checksum.h
> > +++ b/arch/powerpc/include/asm/checksum.h
> > @@ -130,6 +130,22 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
> >  	return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
> >  }
> > 
> > +#define HAVE_ARCH_CSUM_ADD
> > +static inline __wsum csum_add(__wsum csum, __wsum addend)
> > +{
> > +#ifdef __powerpc64__
> > +	u64 res = (__force u64)csum;
> > +
> > +	res += (__force u64)addend;
> > +	return (__force __wsum)((u32)res + (res >> 32));
> > +#else
> > +	asm("addc %0,%0,%1;"
> > +	    "addze %0,%0;"
> > +	    : "+r" (csum) : "r" (addend));
> > +	return csum;
> > +#endif
> 
> I'd have thought it better to test for the cpu type where you want the
> 'asm' variant, and then fall back on the C version for all others.
> I know (well suspect) there are only two cases here.

Usually it's more readable to see "if (x) ... else ..." than "if (!
x) ... else ..." and 64-bit is what has a symbol defined.

> I'd also have thought that the 64bit C version above would be generally 'good'.

It doesn't generate the addc/addze sequence.  At least with GCC 4.8.2,
it does something like:

	mr	tmp0, csum
	li	tmp1, 0
	li	tmp2, 0
	addc	tmp3, addend, tmp0
	adde	csum, tmp2, tmp1
	add	csum, csum, tmp3

-Scott

Segher Boessenkool May 22, 2015, 9:39 p.m. UTC | #3

On Fri, May 22, 2015 at 02:32:42PM -0500, Scott Wood wrote:
> > I'd also have thought that the 64bit C version above would be generally 'good'.
> 
> It doesn't generate the addc/addze sequence.  At least with GCC 4.8.2,
> it does something like:
> 
> 	mr	tmp0, csum
> 	li	tmp1, 0
> 	li	tmp2, 0
> 	addc	tmp3, addend, tmp0
> 	adde	csum, tmp2, tmp1
> 	add	csum, csum, tmp3

Right.  Don't expect older compilers to do sane things here.

All this begs a question...  If it is worth spending so much time
micro-optimising this, why not pick the low-hanging fruit first?
Having a 32-bit accumulator for ones' complement sums, on a 64-bit
system, is not such a great idea.

Segher

Scott Wood May 22, 2015, 9:54 p.m. UTC | #4

On Fri, 2015-05-22 at 16:39 -0500, Segher Boessenkool wrote:
> On Fri, May 22, 2015 at 02:32:42PM -0500, Scott Wood wrote:
> > > I'd also have thought that the 64bit C version above would be generally 'good'.
> > 
> > It doesn't generate the addc/addze sequence.  At least with GCC 4.8.2,
> > it does something like:
> > 
> > 	mr	tmp0, csum
> > 	li	tmp1, 0
> > 	li	tmp2, 0
> > 	addc	tmp3, addend, tmp0
> > 	adde	csum, tmp2, tmp1
> > 	add	csum, csum, tmp3
> 
> Right.  Don't expect older compilers to do sane things here.
> 
> All this begs a question...  If it is worth spending so much time
> micro-optimising this, why not pick the low-hanging fruit first?
> Having a 32-bit accumulator for ones' complement sums, on a 64-bit
> system, is not such a great idea.

That would be a more intrusive change -- not (comparatively) low-hanging
fruit.  Plus, the person submitting these patches is focused on 32-bit.

-Scott

David Laight May 26, 2015, 1:57 p.m. UTC | #5

From: Scott Wood ...

> > I'd also have thought that the 64bit C version above would be generally 'good'.

> 

> It doesn't generate the addc/addze sequence.  At least with GCC 4.8.2,

> it does something like:

> 

> 	mr	tmp0, csum

> 	li	tmp1, 0

> 	li	tmp2, 0

> 	addc	tmp3, addend, tmp0

> 	adde	csum, tmp2, tmp1

> 	add	csum, csum, tmp3


I was thinking of all 64bit targets, not 32bit ones.

	David

Scott Wood May 26, 2015, 7:42 p.m. UTC | #6

On Tue, 2015-05-26 at 13:57 +0000, David Laight wrote:
> From: Scott Wood ...
> > > I'd also have thought that the 64bit C version above would be 
> > > generally 'good'.
> > 
> > It doesn't generate the addc/addze sequence.  At least with GCC 
> > 4.8.2,
> > it does something like:
> > 
> >     mr      tmp0, csum
> >     li      tmp1, 0
> >     li      tmp2, 0
> >     addc    tmp3, addend, tmp0
> >     adde    csum, tmp2, tmp1
> >     add     csum, csum, tmp3
> 
> I was thinking of all 64bit targets, not 32bit ones.

Oh, you mean move it out of arch/powerpc?  Sounds reasonable, but 
someone should probably check what the resulting code looks like on 
other common arches.  OTOH, if we're going to modify non-arch code, 
that might be a good opportunity to implement Segher's suggestion and 
move to a 64-bit accumulator.

-Scott

David Laight May 27, 2015, 8:41 a.m. UTC | #7

From: Scott Wood

> Sent: 26 May 2015 20:43

...
> > I was thinking of all 64bit targets, not 32bit ones.

> 

> Oh, you mean move it out of arch/powerpc?  Sounds reasonable, but

> someone should probably check what the resulting code looks like on

> other common arches.  OTOH, if we're going to modify non-arch code,

> that might be a good opportunity to implement Segher's suggestion and

> move to a 64-bit accumulator.


Or more likely: adding alternate 32bit words to different 64-bit
registers in order to reduce the dependency chains further.
I'm sure I've seen a version that does that somewhere.

	David

[v3,2/2] powerpc: add support for csum_add()

Commit Message

Comments

Patch