Message ID | 20120604175858.38dac554@kryten (mailing list archive) |
---|---|
State | Changes Requested, archived |
Headers | show |
Hi, On Mon, Jun 4, 2012 at 12:58 AM, Anton Blanchard <anton@samba.org> wrote: > > I blame Mikey for this. He elevated my slightly dubious testcase: > > # dd if=/dev/zero of=/dev/null bs=1M count=10000 > > to benchmark status. And naturally we need to be number 1 at creating > zeros. So lets improve __clear_user some more. > > As Paul suggests we can use dcbz for large lengths. This patch gets > the destination 128 byte aligned then uses dcbz on whole cachelines. > > Before: > 10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s > > After: > 10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s > > 39 GB/s, a new record. > > Signed-off-by: Anton Blanchard <anton@samba.org> > --- > > Index: linux-build/arch/powerpc/lib/string_64.S > =================================================================== > --- linux-build.orig/arch/powerpc/lib/string_64.S 2012-06-04 16:18:56.351604302 +1000 > +++ linux-build/arch/powerpc/lib/string_64.S 2012-06-04 16:47:10.538500871 +1000 > @@ -78,7 +78,7 @@ _GLOBAL(__clear_user) [..] > +15: > +err2; dcbz r0,r3 > + addi r3,r3,128 > + addi r4,r4,-128 > + bdnz 15b This breaks architecture spec (and at least one implementation); cache lines are not guaranteed to be 128 bytes. -Olof
On Jun 4, 2012, at 8:12 AM, Olof Johansson wrote: > Hi, > > On Mon, Jun 4, 2012 at 12:58 AM, Anton Blanchard <anton@samba.org> wrote: >> >> I blame Mikey for this. He elevated my slightly dubious testcase: >> >> # dd if=/dev/zero of=/dev/null bs=1M count=10000 >> >> to benchmark status. And naturally we need to be number 1 at creating >> zeros. So lets improve __clear_user some more. >> >> As Paul suggests we can use dcbz for large lengths. This patch gets >> the destination 128 byte aligned then uses dcbz on whole cachelines. >> >> Before: >> 10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s >> >> After: >> 10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s >> >> 39 GB/s, a new record. >> >> Signed-off-by: Anton Blanchard <anton@samba.org> >> --- >> >> Index: linux-build/arch/powerpc/lib/string_64.S >> =================================================================== >> --- linux-build.orig/arch/powerpc/lib/string_64.S 2012-06-04 16:18:56.351604302 +1000 >> +++ linux-build/arch/powerpc/lib/string_64.S 2012-06-04 16:47:10.538500871 +1000 >> @@ -78,7 +78,7 @@ _GLOBAL(__clear_user) > [..] > >> +15: >> +err2; dcbz r0,r3 >> + addi r3,r3,128 >> + addi r4,r4,-128 >> + bdnz 15b > > This breaks architecture spec (and at least one implementation); cache > lines are not guaranteed to be 128 bytes. I'm guessing it breaks more than one (FSL 64-bit is 64byte cache lines). - k
Index: linux-build/arch/powerpc/lib/string_64.S =================================================================== --- linux-build.orig/arch/powerpc/lib/string_64.S 2012-06-04 16:18:56.351604302 +1000 +++ linux-build/arch/powerpc/lib/string_64.S 2012-06-04 16:47:10.538500871 +1000 @@ -78,7 +78,7 @@ _GLOBAL(__clear_user) blt .Lshort_clear mr r8,r3 mtocrf 0x01,r6 - clrldi r6,r6,(64-3) + clrldi r7,r6,(64-3) /* Get the destination 8 byte aligned */ bf cr7*4+3,1f @@ -93,11 +93,16 @@ err1; sth r0,0(r3) err1; stw r0,0(r3) addi r3,r3,4 -3: sub r4,r4,r6 - srdi r6,r4,5 +3: sub r4,r4,r7 + cmpdi r4,32 + cmpdi cr1,r4,512 blt .Lshort_clear - mtctr r6 + bgt cr1,.Llong_clear + +.Lmedium_clear: + srdi r7,r4,5 + mtctr r7 /* Do 32 byte chunks */ 4: @@ -139,3 +144,52 @@ err1; stb r0,0(r3) 10: li r3,0 blr + +.Llong_clear: + /* Destination is 8 byte aligned, need to get it 128 byte aligned */ + bf cr7*4+0,11f +err2; std r0,0(r3) + addi r3,r3,8 + addi r4,r4,-8 + +11: srdi r6,r6,4 + mtocrf 0x01,r6 + + bf cr7*4+3,12f +err2; std r0,0(r3) +err2; std r0,8(r3) + addi r3,r3,16 + addi r4,r4,-16 + +12: bf cr7*4+2,13f +err2; std r0,0(r3) +err2; std r0,8(r3) +err2; std r0,16(r3) +err2; std r0,24(r3) + addi r3,r3,32 + addi r4,r4,-32 + +13: bf cr7*4+1,14f +err2; std r0,0(r3) +err2; std r0,8(r3) +err2; std r0,16(r3) +err2; std r0,24(r3) +err2; std r0,32(r3) +err2; std r0,40(r3) +err2; std r0,48(r3) +err2; std r0,56(r3) + addi r3,r3,64 + addi r4,r4,-64 + +14: srdi r6,r4,7 + mtctr r6 + +15: +err2; dcbz r0,r3 + addi r3,r3,128 + addi r4,r4,-128 + bdnz 15b + + cmpdi r4,32 + blt .Lshort_clear + b .Lmedium_clear
I blame Mikey for this. He elevated my slightly dubious testcase: # dd if=/dev/zero of=/dev/null bs=1M count=10000 to benchmark status. And naturally we need to be number 1 at creating zeros. So lets improve __clear_user some more. As Paul suggests we can use dcbz for large lengths. This patch gets the destination 128 byte aligned then uses dcbz on whole cachelines. Before: 10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s After: 10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s 39 GB/s, a new record. Signed-off-by: Anton Blanchard <anton@samba.org> ---