diff mbox

[MIPS] Modify memset.S for mips32r6/mips64r6

Message ID 2923c970-026c-4e00-be7a-0650e82421b5@BAMAIL02.ba.imgtec.org
State New
Headers show

Commit Message

Steve Ellcey Dec. 19, 2014, 11:26 p.m. UTC
Here is the last of my patches for mips32r6/mips64r6 support.  It updates
memset to use byte copies instead of stl or str to align the destination
because those instructions are not supported in mips32r6 or mips64r6.
It also avoids using the 'prepare for store' prefetch hint because that
is not supported on mips32r6 or mips64r6 either.

Tested with the mips32r6/mips64r6 GCC, binutils and qemu simulator.

OK to checkin?

Steve Ellcey
sellcey@imgtec.com


2014-12-19  Steve Ellcey  <sellcey@imgtec.com>

	* sysdeps/mips/memset.S (memset): Modify for mips32r6/mips64r6
	to avoid using stl/str to align destination.

Comments

Matthew Fortune Dec. 19, 2014, 11:56 p.m. UTC | #1
> +#if __mips_isa_rev > 5

I've just so happened to be starting some mips2 tests with glibc
and found see that __mips_isa_rev is not defined below mips32.
This is then an error given -Werror is in force.

i.e. I think we will need a defined(__mips_isa_rev) in any of the
conditions which use it. This is likely to affect all of the R6
patches.

Thanks,
Matthew
Joseph Myers Dec. 20, 2014, 12:23 a.m. UTC | #2
On Fri, 19 Dec 2014, Matthew Fortune wrote:

> > +#if __mips_isa_rev > 5
> 
> I've just so happened to be starting some mips2 tests with glibc
> and found see that __mips_isa_rev is not defined below mips32.
> This is then an error given -Werror is in force.

We still have -Wno-error=undef, so it *shouldn't* be an error (but once 
Siddhesh's remaining -Wundef fixes are in, and any further fixes needed to 
get clean for -Wundef in at least one configuration, we should remove 
-Wno-error=undef, so such issues certainly need fixing soon).

> i.e. I think we will need a defined(__mips_isa_rev) in any of the
> conditions which use it. This is likely to affect all of the R6
> patches.

Please put this in a central place (a single header that defines 
__mips_isa_rev to 0 if not already defined, for example) rather than 
duplicating "defined" tests in different places ("defined" tests are 
typo-prone and so discouraged in glibc).
Matthew Fortune Dec. 20, 2014, 7:39 a.m. UTC | #3
Joseph Myers <joseph@codesourcery.com> writes:
> On Fri, 19 Dec 2014, Matthew Fortune wrote:
> 
> > > +#if __mips_isa_rev > 5
> >
> > I've just so happened to be starting some mips2 tests with glibc and
> > found see that __mips_isa_rev is not defined below mips32.
> > This is then an error given -Werror is in force.
> 
> We still have -Wno-error=undef, so it *shouldn't* be an error (but once
> Siddhesh's remaining -Wundef fixes are in, and any further fixes needed
> to get clean for -Wundef in at least one configuration, we should remove
> -Wno-error=undef, so such issues certainly need fixing soon).

Ah, OK. I think my build must have failed for a different reason
and there was __mips_isa_rev warning close to the end that I
misread as an error.

I guess we can leave it to a cleanup commit for all the instances
then.

Matthew
Ondřej Bílka Dec. 20, 2014, 9:09 a.m. UTC | #4
On Fri, Dec 19, 2014 at 03:26:44PM -0800, Steve Ellcey  wrote:
> Here is the last of my patches for mips32r6/mips64r6 support.  It updates
> memset to use byte copies instead of stl or str to align the destination
> because those instructions are not supported in mips32r6 or mips64r6.
> It also avoids using the 'prepare for store' prefetch hint because that
> is not supported on mips32r6 or mips64r6 either.
> 
> Tested with the mips32r6/mips64r6 GCC, binutils and qemu simulator.
> 
> OK to checkin?
> 
> Steve Ellcey
> sellcey@imgtec.com
> 
> 
>  	PTR_ADDU a0,a0,t2
> +#else /* R6_CODE */
> +	andi	t2,a0,7
> +	lapc	t9,L(atable)
> +	PTR_LSA	t9,t2,t9,2
> +	jrc	t9
> +L(atable):
> +	bc	L(aligned)
> +	bc	L(lb7)

That could be performance regression, test if its faster than existing
loop on unpredictable branches [B]

Also try if just branches are better, like in following c code [A]

Table lookup could be even slower in real workloads as it adds latency
when table is not in cache.

From practical standpoint realigning code looks like dead code, on x64 
83% percents of calls are 16 byte aligned and I cannot find application
that makes call unaligned to 8 bytes.

You will get better speedup by adding a check if its already aligned and
moving realignment code to bottom of file to improve instruction cache
usage.

[A]

if (((int) x) & 1)
  *x = mask;
x &= ~1;

if (((int) x) & 2)
  *((uint16_t*) x) = mask;
x &= ~2;

if (((int) x) & 4)
  *((uint32_t*) x) = mask;
x &= ~4;



[B]

#include <string.h>

int main (int x)
{
  char foo[100];
  int i; 
  for (i = 0; i < 100000000; i++)
    memset (foo + (i % 16), 1, 32 - (i % 16));
        
  return foo[17];
}
Joseph Myers Dec. 20, 2014, 11:12 a.m. UTC | #5
On Sat, 20 Dec 2014, Matthew Fortune wrote:

> Joseph Myers <joseph@codesourcery.com> writes:
> > On Fri, 19 Dec 2014, Matthew Fortune wrote:
> > 
> > > > +#if __mips_isa_rev > 5
> > >
> > > I've just so happened to be starting some mips2 tests with glibc and
> > > found see that __mips_isa_rev is not defined below mips32.
> > > This is then an error given -Werror is in force.
> > 
> > We still have -Wno-error=undef, so it *shouldn't* be an error (but once
> > Siddhesh's remaining -Wundef fixes are in, and any further fixes needed
> > to get clean for -Wundef in at least one configuration, we should remove
> > -Wno-error=undef, so such issues certainly need fixing soon).
> 
> Ah, OK. I think my build must have failed for a different reason
> and there was __mips_isa_rev warning close to the end that I
> misread as an error.
> 
> I guess we can leave it to a cleanup commit for all the instances
> then.

I'll wait for the cleanup and revised versions of the memset / memcpy 
patches (on the basis that the identified -Wundef issue means that the 
memset / memcpy patches can't go in as-is).
Joseph Myers Dec. 22, 2014, 6:02 p.m. UTC | #6
On Fri, 19 Dec 2014, Steve Ellcey  wrote:

> @@ -188,9 +205,9 @@ LEAF(MEMSET_NAME)
>  
>  	.set	nomips16
>  	.set	noreorder
> -/* If the size is less than 2*NSIZE (8 or 16), go to L(lastb).  Regardless of
> +/* If the size is less than 4*NSIZE (16 or 32), go to L(lastb).  Regardless of
>     size, copy dst pointer to v0 for the return value.  */
> -	slti	t2,a2,(2 * NSIZE)
> +	slti	t2,a2,(4 * NSIZE)
>  	bne	t2,zero,L(lastb)
>  	move	v0,a0
>  

This doesn't appear to have anything obvious to do with r6.  Please submit 
it separately from the r6 changes, with its own rationale (or if it is in 
fact needed as part of the r6 changes, please explain further).

In general, r6 changes should not result in any changes to the code in 
glibc built for non-r6 - if the best way to support r6 also involves 
changes to code for non-r6 (e.g. if some cleanup or optimization naturally 
results in r6 support because the non-r6-compatible code was bad in some 
way and r6-compatible code is better for all versions, or if code 
compatible with both r6 and pre-r6 can be just as good on all versions as 
the existing non-r6-compatible code) then it's best to submit those 
changes separately.
Steve Ellcey Dec. 22, 2014, 7:20 p.m. UTC | #7
On Mon, 2014-12-22 at 18:02 +0000, Joseph Myers wrote:
> On Fri, 19 Dec 2014, Steve Ellcey  wrote:
> 
> > @@ -188,9 +205,9 @@ LEAF(MEMSET_NAME)
> >  
> >  	.set	nomips16
> >  	.set	noreorder
> > -/* If the size is less than 2*NSIZE (8 or 16), go to L(lastb).  Regardless of
> > +/* If the size is less than 4*NSIZE (16 or 32), go to L(lastb).  Regardless of
> >     size, copy dst pointer to v0 for the return value.  */
> > -	slti	t2,a2,(2 * NSIZE)
> > +	slti	t2,a2,(4 * NSIZE)
> >  	bne	t2,zero,L(lastb)
> >  	move	v0,a0
> >  
> 
> This doesn't appear to have anything obvious to do with r6.  Please submit 
> it separately from the r6 changes, with its own rationale (or if it is in 
> fact needed as part of the r6 changes, please explain further).
> 
> In general, r6 changes should not result in any changes to the code in 
> glibc built for non-r6 - if the best way to support r6 also involves 
> changes to code for non-r6 (e.g. if some cleanup or optimization naturally 
> results in r6 support because the non-r6-compatible code was bad in some 
> way and r6-compatible code is better for all versions, or if code 
> compatible with both r6 and pre-r6 can be just as good on all versions as 
> the existing non-r6-compatible code) then it's best to submit those 
> changes separately.

The change is needed for r6 because I align the buffer to 8 bytes
instead of to 4 bytes for 32 bit mode and 8 bytes for 64 bit mode.  This
should help with cache alignment and, depending on the CPU, with memory
bonding where two sequential 4 byte loads can be handled as if they were
a single 8 byte load.

Without the slti change (assuming O32 mode), if I have a non-aligned 9
byte buffer I could set up to 7 bytes to get it 8 byte aligned and that
leaves less then a full 4 byte word to be set.  The existing code after
this check assumes there is at lease one full word needs to be set.

I considered changing the alignment code to only align on a 4 byte
boundary for O32 mode, or ifdef'ing this test but it seemed cleaner to
increase the minimum size of buffers that get handled via a simple byte
copy loop for both r6 and earlier CPU's.

If you don't like this part of the patch I could go back to aligning on
a 4 byte boundary in 32 bit mode for now and get some performance data
on the different alignment options later.

Steve Ellcey
sellcey@imgtec.com
Joseph Myers Dec. 22, 2014, 8:02 p.m. UTC | #8
On Mon, 22 Dec 2014, Steve Ellcey wrote:

> I considered changing the alignment code to only align on a 4 byte
> boundary for O32 mode, or ifdef'ing this test but it seemed cleaner to
> increase the minimum size of buffers that get handled via a simple byte
> copy loop for both r6 and earlier CPU's.

In that case, submit the change as a preparatory patch, with its own 
justification for being OK for existing CPUs (that it doesn't affect 
performance, or whatever), so that the r6 patch can avoid changing pre-r6 
code.
diff mbox

Patch

diff --git a/sysdeps/mips/memset.S b/sysdeps/mips/memset.S
index 9991480..4cc6dd7 100644
--- a/sysdeps/mips/memset.S
+++ b/sysdeps/mips/memset.S
@@ -54,6 +54,14 @@ 
 # endif
 #endif
 
+#if __mips_isa_rev > 5
+# if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
+#  undef PREFETCH_STORE_HINT
+#  define PREFETCH_STORE_HINT PREFETCH_HINT_STORE_STREAMED
+# endif
+# define R6_CODE
+#endif
+
 /* Some asm.h files do not have the L macro definition.  */
 #ifndef L
 # if _MIPS_SIM == _ABIO32
@@ -72,6 +80,15 @@ 
 # endif
 #endif
 
+/* New R6 instructions that may not be in asm.h.  */
+#ifndef PTR_LSA
+# if _MIPS_SIM == _ABI64
+#  define PTR_LSA        dlsa
+# else
+#  define PTR_LSA        lsa
+# endif
+#endif
+
 /* Using PREFETCH_HINT_PREPAREFORSTORE instead of PREFETCH_STORE
    or PREFETCH_STORE_STREAMED offers a large performance advantage
    but PREPAREFORSTORE has some special restrictions to consider.
@@ -188,9 +205,9 @@  LEAF(MEMSET_NAME)
 
 	.set	nomips16
 	.set	noreorder
-/* If the size is less than 2*NSIZE (8 or 16), go to L(lastb).  Regardless of
+/* If the size is less than 4*NSIZE (16 or 32), go to L(lastb).  Regardless of
    size, copy dst pointer to v0 for the return value.  */
-	slti	t2,a2,(2 * NSIZE)
+	slti	t2,a2,(4 * NSIZE)
 	bne	t2,zero,L(lastb)
 	move	v0,a0
 
@@ -231,11 +248,46 @@  LEAF(MEMSET_NAME)
 /* If the destination address is not aligned do a partial store to get it
    aligned.  If it is already aligned just jump to L(aligned).  */
 L(set0):
+#if !defined (R6_CODE)
 	andi	t2,a3,(NSIZE-1)		/* word-unaligned address?          */
 	beq	t2,zero,L(aligned)	/* t2 is the unalignment count      */
 	PTR_SUBU a2,a2,t2
 	C_STHI	a1,0(a0)
 	PTR_ADDU a0,a0,t2
+#else /* R6_CODE */
+	andi	t2,a0,7
+	lapc	t9,L(atable)
+	PTR_LSA	t9,t2,t9,2
+	jrc	t9
+L(atable):
+	bc	L(aligned)
+	bc	L(lb7)
+	bc	L(lb6)
+	bc	L(lb5)
+	bc	L(lb4)
+	bc	L(lb3)
+	bc	L(lb2)
+	bc	L(lb1)
+L(lb7):
+	sb	a1,6(a0)
+L(lb6):
+	sb	a1,5(a0)
+L(lb5):
+	sb	a1,4(a0)
+L(lb4):
+	sb	a1,3(a0)
+L(lb3):
+	sb	a1,2(a0)
+L(lb2):
+	sb	a1,1(a0)
+L(lb1):
+	sb	a1,0(a0)
+
+	li	t9,8
+	subu	t2,t9,t2
+	PTR_SUBU a2,a2,t2
+	PTR_ADDU a0,a0,t2
+#endif /* R6_CODE */
 
 L(aligned):
 /* If USE_DOUBLE is not set we may still want to align the data on a 16
@@ -286,8 +338,12 @@  L(loop16w):
 	bgtz	v1,L(skip_pref)
 	nop
 #endif
+#if defined(R6_CODE)
+	PREFETCH_FOR_STORE (2, a0)
+#else
 	PREFETCH_FOR_STORE (4, a0)
 	PREFETCH_FOR_STORE (5, a0)
+#endif
 L(skip_pref):
 	C_ST	a1,UNIT(0)(a0)
 	C_ST	a1,UNIT(1)(a0)