diff mbox

[RS6000] improve builtin expansion of memcmp for p7

Message ID 1475788351.5970.6.camel@linux.vnet.ibm.com
State New
Headers show

Commit Message

Aaron Sawdey Oct. 6, 2016, 9:12 p.m. UTC
I've improved the builtin memcmp expansion so it avoids a couple of 
things that p7 and previous processors don't like. Performance on
p7 is now never worse than glibc memcmp(). Bootstrap/regtest in progress
on power7 ppc64 BE. 

OK for trunk if testing passes?


gcc/ChangeLog:

2016-10-06  Aaron Sawdey  <acsawdey@linux.vnet.ibm.com>

	* config/rs6000/rs6000.h (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED)
	Add macro to say we can efficiently handle overlapping unaligned
	loads.
	* config/rs6000/rs6000.c (expand_block_compare): Avoid generating
	poor code for processors older than p8.

Comments

Segher Boessenkool Oct. 6, 2016, 9:40 p.m. UTC | #1
Hi Aaron,

On Thu, Oct 06, 2016 at 04:12:31PM -0500, Aaron Sawdey wrote:
> I've improved the builtin memcmp expansion so it avoids a couple of 
> things that p7 and previous processors don't like. Performance on
> p7 is now never worse than glibc memcmp(). Bootstrap/regtest in progress
> on power7 ppc64 BE. 
> 
> OK for trunk if testing passes?

Okay, thanks.  Just a few formatting nits...


> 2016-10-06  Aaron Sawdey  <acsawdey@linux.vnet.ibm.com>
> 
> 	* config/rs6000/rs6000.h (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED)

Needs a colon at the end of line here.

> 	Add macro to say we can efficiently handle overlapping unaligned
> 	loads.


> @@ -18736,13 +18744,18 @@
>    while (bytes > 0)
>      {
>        int align = compute_current_alignment (base_align, offset);
> -      load_mode = select_block_compare_mode(offset, bytes, align, word_mode_ok);
> +      if (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED)
> +	load_mode = select_block_compare_mode(offset, bytes, align,
> +					      word_mode_ok);

Space before paren.

Thanks,


Segher
diff mbox

Patch

Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 240816)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -18687,6 +18687,14 @@ 
   if (bytes <= 0)
     return true;
 
+  /* The code generated for p7 and older is not faster than glibc
+     memcmp if alignment is small and length is not short, so bail
+     out to avoid those conditions.  */
+  if (!TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
+      && ((base_align == 1 && bytes > 16)
+	  || (base_align == 2 && bytes > 32)))
+    return false;
+
   rtx tmp_reg_src1 = gen_reg_rtx (word_mode);
   rtx tmp_reg_src2 = gen_reg_rtx (word_mode);
 
@@ -18736,13 +18744,18 @@ 
   while (bytes > 0)
     {
       int align = compute_current_alignment (base_align, offset);
-      load_mode = select_block_compare_mode(offset, bytes, align, word_mode_ok);
+      if (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED)
+	load_mode = select_block_compare_mode(offset, bytes, align,
+					      word_mode_ok);
+      else
+	load_mode = select_block_compare_mode(0, bytes, align, word_mode_ok);
       load_mode_size = GET_MODE_SIZE (load_mode);
       if (bytes >= load_mode_size)
 	cmp_bytes = load_mode_size;
-      else
+      else if (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED)
 	{
-	  /* Move this load back so it doesn't go past the end.  */
+	  /* Move this load back so it doesn't go past the end.
+	     P8/P9 can do this efficiently.  */
 	  int extra_bytes = load_mode_size - bytes;
 	  cmp_bytes = bytes;
 	  if (extra_bytes < offset)
@@ -18752,7 +18765,12 @@ 
 	      bytes = cmp_bytes;
 	    }
 	}
-
+      else
+	/* P7 and earlier can't do the overlapping load trick fast,
+	   so this forces a non-overlapping load and a shift to get
+	   rid of the extra bytes.  */
+	cmp_bytes = bytes;
+      
       src1 = adjust_address (orig_src1, load_mode, offset);
       src2 = adjust_address (orig_src2, load_mode, offset);
 
Index: gcc/config/rs6000/rs6000.h
===================================================================
--- gcc/config/rs6000/rs6000.h	(revision 240816)
+++ gcc/config/rs6000/rs6000.h	(working copy)
@@ -603,6 +603,9 @@ 
 				 && TARGET_POWERPC64)
 #define TARGET_VEXTRACTUB	(TARGET_P9_VECTOR && TARGET_DIRECT_MOVE \
 				 && TARGET_UPPER_REGS_DI && TARGET_POWERPC64)
+/* This wants to be set for p8 and newer.  On p7, overlapping unaligned
+   loads are slow. */
+#define TARGET_EFFICIENT_OVERLAPPING_UNALIGNED TARGET_EFFICIENT_UNALIGNED_VSX
 
 /* Byte/char syncs were added as phased in for ISA 2.06B, but are not present
    in power7, so conditionalize them on p8 features.  TImode syncs need quad