Message ID | 1d1362c8aa696e316d3ba97dce2342df6f6ee6cf.1432047904.git.christophe.leroy@c-s.fr (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Scott Wood |
Headers | show |
From: Linuxppc-dev Christophe Leroy > Sent: 19 May 2015 16:19 ... > diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h > index 5e43d2d..e8d9ef4 100644 > --- a/arch/powerpc/include/asm/checksum.h > +++ b/arch/powerpc/include/asm/checksum.h > @@ -130,6 +130,22 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, > return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum)); > } > > +#define HAVE_ARCH_CSUM_ADD > +static inline __wsum csum_add(__wsum csum, __wsum addend) > +{ > +#ifdef __powerpc64__ > + u64 res = (__force u64)csum; > + > + res += (__force u64)addend; > + return (__force __wsum)((u32)res + (res >> 32)); > +#else > + asm("addc %0,%0,%1;" > + "addze %0,%0;" > + : "+r" (csum) : "r" (addend)); > + return csum; > +#endif I'd have thought it better to test for the cpu type where you want the 'asm' variant, and then fall back on the C version for all others. I know (well suspect) there are only two cases here. I'd also have thought that the 64bit C version above would be generally 'good'. David
On Fri, 2015-05-22 at 15:57 +0000, David Laight wrote: > From: Linuxppc-dev Christophe Leroy > > Sent: 19 May 2015 16:19 > ... > > diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h > > index 5e43d2d..e8d9ef4 100644 > > --- a/arch/powerpc/include/asm/checksum.h > > +++ b/arch/powerpc/include/asm/checksum.h > > @@ -130,6 +130,22 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, > > return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum)); > > } > > > > +#define HAVE_ARCH_CSUM_ADD > > +static inline __wsum csum_add(__wsum csum, __wsum addend) > > +{ > > +#ifdef __powerpc64__ > > + u64 res = (__force u64)csum; > > + > > + res += (__force u64)addend; > > + return (__force __wsum)((u32)res + (res >> 32)); > > +#else > > + asm("addc %0,%0,%1;" > > + "addze %0,%0;" > > + : "+r" (csum) : "r" (addend)); > > + return csum; > > +#endif > > I'd have thought it better to test for the cpu type where you want the > 'asm' variant, and then fall back on the C version for all others. > I know (well suspect) there are only two cases here. Usually it's more readable to see "if (x) ... else ..." than "if (! x) ... else ..." and 64-bit is what has a symbol defined. > I'd also have thought that the 64bit C version above would be generally 'good'. It doesn't generate the addc/addze sequence. At least with GCC 4.8.2, it does something like: mr tmp0, csum li tmp1, 0 li tmp2, 0 addc tmp3, addend, tmp0 adde csum, tmp2, tmp1 add csum, csum, tmp3 -Scott
On Fri, May 22, 2015 at 02:32:42PM -0500, Scott Wood wrote: > > I'd also have thought that the 64bit C version above would be generally 'good'. > > It doesn't generate the addc/addze sequence. At least with GCC 4.8.2, > it does something like: > > mr tmp0, csum > li tmp1, 0 > li tmp2, 0 > addc tmp3, addend, tmp0 > adde csum, tmp2, tmp1 > add csum, csum, tmp3 Right. Don't expect older compilers to do sane things here. All this begs a question... If it is worth spending so much time micro-optimising this, why not pick the low-hanging fruit first? Having a 32-bit accumulator for ones' complement sums, on a 64-bit system, is not such a great idea. Segher
On Fri, 2015-05-22 at 16:39 -0500, Segher Boessenkool wrote: > On Fri, May 22, 2015 at 02:32:42PM -0500, Scott Wood wrote: > > > I'd also have thought that the 64bit C version above would be generally 'good'. > > > > It doesn't generate the addc/addze sequence. At least with GCC 4.8.2, > > it does something like: > > > > mr tmp0, csum > > li tmp1, 0 > > li tmp2, 0 > > addc tmp3, addend, tmp0 > > adde csum, tmp2, tmp1 > > add csum, csum, tmp3 > > Right. Don't expect older compilers to do sane things here. > > All this begs a question... If it is worth spending so much time > micro-optimising this, why not pick the low-hanging fruit first? > Having a 32-bit accumulator for ones' complement sums, on a 64-bit > system, is not such a great idea. That would be a more intrusive change -- not (comparatively) low-hanging fruit. Plus, the person submitting these patches is focused on 32-bit. -Scott
From: Scott Wood ... > > I'd also have thought that the 64bit C version above would be generally 'good'. > > It doesn't generate the addc/addze sequence. At least with GCC 4.8.2, > it does something like: > > mr tmp0, csum > li tmp1, 0 > li tmp2, 0 > addc tmp3, addend, tmp0 > adde csum, tmp2, tmp1 > add csum, csum, tmp3 I was thinking of all 64bit targets, not 32bit ones. David
On Tue, 2015-05-26 at 13:57 +0000, David Laight wrote: > From: Scott Wood ... > > > I'd also have thought that the 64bit C version above would be > > > generally 'good'. > > > > It doesn't generate the addc/addze sequence. At least with GCC > > 4.8.2, > > it does something like: > > > > mr tmp0, csum > > li tmp1, 0 > > li tmp2, 0 > > addc tmp3, addend, tmp0 > > adde csum, tmp2, tmp1 > > add csum, csum, tmp3 > > I was thinking of all 64bit targets, not 32bit ones. Oh, you mean move it out of arch/powerpc? Sounds reasonable, but someone should probably check what the resulting code looks like on other common arches. OTOH, if we're going to modify non-arch code, that might be a good opportunity to implement Segher's suggestion and move to a 64-bit accumulator. -Scott
From: Scott Wood > Sent: 26 May 2015 20:43 ... > > I was thinking of all 64bit targets, not 32bit ones. > > Oh, you mean move it out of arch/powerpc? Sounds reasonable, but > someone should probably check what the resulting code looks like on > other common arches. OTOH, if we're going to modify non-arch code, > that might be a good opportunity to implement Segher's suggestion and > move to a 64-bit accumulator. Or more likely: adding alternate 32bit words to different 64-bit registers in order to reduce the dependency chains further. I'm sure I've seen a version that does that somewhere. David
diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h index 5e43d2d..e8d9ef4 100644 --- a/arch/powerpc/include/asm/checksum.h +++ b/arch/powerpc/include/asm/checksum.h @@ -130,6 +130,22 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum)); } +#define HAVE_ARCH_CSUM_ADD +static inline __wsum csum_add(__wsum csum, __wsum addend) +{ +#ifdef __powerpc64__ + u64 res = (__force u64)csum; + + res += (__force u64)addend; + return (__force __wsum)((u32)res + (res >> 32)); +#else + asm("addc %0,%0,%1;" + "addze %0,%0;" + : "+r" (csum) : "r" (addend)); + return csum; +#endif +} + #endif #endif /* __KERNEL__ */ #endif
The C version of csum_add() as defined in include/net/checksum.h gives the following assembly in ppc32: 0: 7c 04 1a 14 add r0,r4,r3 4: 7c 64 00 10 subfc r3,r4,r0 8: 7c 63 19 10 subfe r3,r3,r3 c: 7c 63 00 50 subf r3,r3,r0 and the following in ppc64: 0xc000000000001af8 <+0>: add r3,r3,r4 0xc000000000001afc <+4>: cmplw cr7,r3,r4 0xc000000000001b00 <+8>: mfcr r4 0xc000000000001b04 <+12>: rlwinm r4,r4,29,31,31 0xc000000000001b08 <+16>: add r3,r4,r3 0xc000000000001b0c <+20>: clrldi r3,r3,32 0xc000000000001b10 <+24>: blr include/net/checksum.h also offers the possibility to define an arch specific function. This patch provides a specific csum_add() inline function. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> --- arch/powerpc/include/asm/checksum.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)