diff mbox series

Prefer mempcpy to memcpy on x86_64 target (PR middle-end/81657, take 2)

Message ID 20180412173722.GP8577@tucnak
State New
Headers show
Series Prefer mempcpy to memcpy on x86_64 target (PR middle-end/81657, take 2) | expand

Commit Message

Jakub Jelinek April 12, 2018, 5:37 p.m. UTC
On Thu, Apr 12, 2018 at 05:29:35PM +0000, Wilco Dijkstra wrote:
> > Depending on what you mean old, I see e.g. in 2010 power7 mempcpy got added,
> > in 2013 other power versions, in 2016 s390*, etc.  Doing a decent mempcpy
> > isn't hard if you have asm version of memcpy and one spare register.
> 
> More mempcpy implementations have been added in recent years indeed, but almost all
> add an extra copy of the memcpy code rather than using a single combined implementation.
> That means it is still better to call memcpy (which is frequently used and thus likely in L1/L2)
> rather than mempcpy (which is more likely to be cold and thus not cached).

That really depends, usually when some app uses mempcpy, it uses it very
heavily.  And all the proposed patches do is honor what the user asked, if
you use memcpy () + n, we aren't transforming that into mempcpy behind the
user's back.

Anyway, here is what I think Richard was asking for, that I'm currently
bootstrapping/regtesting.  It can be easily combined with Martin's target
hook if needed, or do it only for
endp == 1 && target != const0_rtx && CALL_EXPR_TAILCALL (exp)
etc.

2018-04-12  Martin Liska  <mliska@suse.cz>
	    Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/81657
	* expr.h (enum block_op_methods): Add BLOCK_OP_NO_LIBCALL_RET.
	* expr.c (emit_block_move_hints): Handle BLOCK_OP_NO_LIBCALL_RET.
	* builtins.c (expand_builtin_memory_copy_args): Use
	BLOCK_OP_NO_LIBCALL_RET method for mempcpy with non-ignored target,
	handle dest_addr == pc_rtx.

	* gcc.dg/string-opt-1.c: Remove bogus comment.  Expect a mempcpy
	call.



	Jakub

Comments

Jakub Jelinek April 12, 2018, 10:35 p.m. UTC | #1
On Thu, Apr 12, 2018 at 07:37:22PM +0200, Jakub Jelinek wrote:
> On Thu, Apr 12, 2018 at 05:29:35PM +0000, Wilco Dijkstra wrote:
> > > Depending on what you mean old, I see e.g. in 2010 power7 mempcpy got added,
> > > in 2013 other power versions, in 2016 s390*, etc.  Doing a decent mempcpy
> > > isn't hard if you have asm version of memcpy and one spare register.
> > 
> > More mempcpy implementations have been added in recent years indeed, but almost all
> > add an extra copy of the memcpy code rather than using a single combined implementation.
> > That means it is still better to call memcpy (which is frequently used and thus likely in L1/L2)
> > rather than mempcpy (which is more likely to be cold and thus not cached).
> 
> That really depends, usually when some app uses mempcpy, it uses it very
> heavily.  And all the proposed patches do is honor what the user asked, if
> you use memcpy () + n, we aren't transforming that into mempcpy behind the
> user's back.
> 
> Anyway, here is what I think Richard was asking for, that I'm currently
> bootstrapping/regtesting.  It can be easily combined with Martin's target
> hook if needed, or do it only for
> endp == 1 && target != const0_rtx && CALL_EXPR_TAILCALL (exp)
> etc.
> 
> 2018-04-12  Martin Liska  <mliska@suse.cz>
> 	    Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR middle-end/81657
> 	* expr.h (enum block_op_methods): Add BLOCK_OP_NO_LIBCALL_RET.
> 	* expr.c (emit_block_move_hints): Handle BLOCK_OP_NO_LIBCALL_RET.
> 	* builtins.c (expand_builtin_memory_copy_args): Use
> 	BLOCK_OP_NO_LIBCALL_RET method for mempcpy with non-ignored target,
> 	handle dest_addr == pc_rtx.
> 
> 	* gcc.dg/string-opt-1.c: Remove bogus comment.  Expect a mempcpy
> 	call.

Successfully bootstrapped/regtested on x86_64-linux and i686-linux.

	Jakub
Richard Biener April 13, 2018, 8:31 a.m. UTC | #2
On April 13, 2018 12:35:54 AM GMT+02:00, Jakub Jelinek <jakub@redhat.com> wrote:
>On Thu, Apr 12, 2018 at 07:37:22PM +0200, Jakub Jelinek wrote:
>> On Thu, Apr 12, 2018 at 05:29:35PM +0000, Wilco Dijkstra wrote:
>> > > Depending on what you mean old, I see e.g. in 2010 power7 mempcpy
>got added,
>> > > in 2013 other power versions, in 2016 s390*, etc.  Doing a decent
>mempcpy
>> > > isn't hard if you have asm version of memcpy and one spare
>register.
>> > 
>> > More mempcpy implementations have been added in recent years
>indeed, but almost all
>> > add an extra copy of the memcpy code rather than using a single
>combined implementation.
>> > That means it is still better to call memcpy (which is frequently
>used and thus likely in L1/L2)
>> > rather than mempcpy (which is more likely to be cold and thus not
>cached).
>> 
>> That really depends, usually when some app uses mempcpy, it uses it
>very
>> heavily.  And all the proposed patches do is honor what the user
>asked, if
>> you use memcpy () + n, we aren't transforming that into mempcpy
>behind the
>> user's back.
>> 
>> Anyway, here is what I think Richard was asking for, that I'm
>currently
>> bootstrapping/regtesting.  It can be easily combined with Martin's
>target
>> hook if needed, or do it only for
>> endp == 1 && target != const0_rtx && CALL_EXPR_TAILCALL (exp)
>> etc.
>> 
>> 2018-04-12  Martin Liska  <mliska@suse.cz>
>> 	    Jakub Jelinek  <jakub@redhat.com>
>> 
>> 	PR middle-end/81657
>> 	* expr.h (enum block_op_methods): Add BLOCK_OP_NO_LIBCALL_RET.
>> 	* expr.c (emit_block_move_hints): Handle BLOCK_OP_NO_LIBCALL_RET.
>> 	* builtins.c (expand_builtin_memory_copy_args): Use
>> 	BLOCK_OP_NO_LIBCALL_RET method for mempcpy with non-ignored target,
>> 	handle dest_addr == pc_rtx.
>> 
>> 	* gcc.dg/string-opt-1.c: Remove bogus comment.  Expect a mempcpy
>> 	call.
>
>Successfully bootstrapped/regtested on x86_64-linux and i686-linux.

OK. 

Thanks, 
Richard. 

>	Jakub
Wilco Dijkstra April 13, 2018, 12:10 p.m. UTC | #3
Jakub Jelinek wrote:  
>On Thu, Apr 12, 2018 at 05:29:35PM +0000, Wilco Dijkstra wrote:
>> > Depending on what you mean old, I see e.g. in 2010 power7 mempcpy got added,
>> > in 2013 other power versions, in 2016 s390*, etc.  Doing a decent mempcpy
>> > isn't hard if you have asm version of memcpy and one spare register.
>> 
>> More mempcpy implementations have been added in recent years indeed, but almost all
>> add an extra copy of the memcpy code rather than using a single combined implementation.
>> That means it is still better to call memcpy (which is frequently used and thus likely in L1/L2)
>> rather than mempcpy (which is more likely to be cold and thus not cached).
>
> That really depends, usually when some app uses mempcpy, it uses it very
> heavily.

But it would have to not use memcpy nearby. Do you have examples of apps using
mempcpy significantly? GLIBC is the only case I've seen that uses mempcpy.

> And all the proposed patches do is honor what the user asked, if
> you use memcpy () + n, we aren't transforming that into mempcpy behind the
> user's back.

We transform a lot of calls behind the user's back so that's not a plausible argument
for "honoring" the original call. Eg. (void)mempcpy always gets changed to memcpy,
bcopy to memmove, bzero to memset, strchr (s, 0) into strlen(s) + s - the list is long
and there are plenty cases where these expansions block tailcalls.

> Anyway, here is what I think Richard was asking for, that I'm currently
> bootstrapping/regtesting.  It can be easily combined with Martin's target
> hook if needed, or do it only for
> endp == 1 && target != const0_rtx && CALL_EXPR_TAILCALL (exp)

This patch causes regressions on AArch64 since it now always calls mempcpy
again, so yes either it would need to be done only for tailcalls (which fixes the
reported regression) or we need Martin's change too so each target can state whether
they have a fast mempcpy or not.

Wilco
H.J. Lu April 13, 2018, 12:18 p.m. UTC | #4
On Fri, Apr 13, 2018 at 5:10 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> This patch causes regressions on AArch64 since it now always calls mempcpy
> again, so yes either it would need to be done only for tailcalls (which fixes the
> reported regression) or we need Martin's change too so each target can state whether
> they have a fast mempcpy or not.
>

You should open a bug report to track it.
Jakub Jelinek April 13, 2018, 12:25 p.m. UTC | #5
On Fri, Apr 13, 2018 at 12:10:33PM +0000, Wilco Dijkstra wrote:
> > Anyway, here is what I think Richard was asking for, that I'm currently
> > bootstrapping/regtesting.  It can be easily combined with Martin's target
> > hook if needed, or do it only for
> > endp == 1 && target != const0_rtx && CALL_EXPR_TAILCALL (exp)
> 
> This patch causes regressions on AArch64 since it now always calls mempcpy
> again, so yes either it would need to be done only for tailcalls (which fixes the

No, it only uses mempcpy if we'd otherwise call memcpy library function and user
wrote mempcpy.  AFAIK 7.x and earlier behaved that way too, so it isn't a
regression, regression is only from released GCC versions.  And, a fix is
easy, just implement fast mempcpy on aarch64 ;)

	Jakub
diff mbox series

Patch

--- gcc/expr.h.jj	2018-01-12 11:35:51.424222835 +0100
+++ gcc/expr.h	2018-04-12 18:38:07.377464114 +0200
@@ -100,7 +100,11 @@  enum block_op_methods
   BLOCK_OP_NO_LIBCALL,
   BLOCK_OP_CALL_PARM,
   /* Like BLOCK_OP_NORMAL, but the libcall can be tail call optimized.  */
-  BLOCK_OP_TAILCALL
+  BLOCK_OP_TAILCALL,
+  /* Like BLOCK_OP_NO_LIBCALL, but instead of emitting a libcall return
+     pc_rtx to indicate nothing has been emitted and let the caller handle
+     it.  */
+  BLOCK_OP_NO_LIBCALL_RET
 };
 
 typedef rtx (*by_pieces_constfn) (void *, HOST_WIDE_INT, scalar_int_mode);
--- gcc/expr.c.jj	2018-04-06 19:19:14.954130838 +0200
+++ gcc/expr.c	2018-04-12 18:39:58.866536619 +0200
@@ -1565,7 +1565,7 @@  emit_block_move_hints (rtx x, rtx y, rtx
 		       unsigned HOST_WIDE_INT max_size,
 		       unsigned HOST_WIDE_INT probable_max_size)
 {
-  bool may_use_call;
+  int may_use_call;
   rtx retval = 0;
   unsigned int align;
 
@@ -1577,7 +1577,7 @@  emit_block_move_hints (rtx x, rtx y, rtx
     {
     case BLOCK_OP_NORMAL:
     case BLOCK_OP_TAILCALL:
-      may_use_call = true;
+      may_use_call = 1;
       break;
 
     case BLOCK_OP_CALL_PARM:
@@ -1589,7 +1589,11 @@  emit_block_move_hints (rtx x, rtx y, rtx
       break;
 
     case BLOCK_OP_NO_LIBCALL:
-      may_use_call = false;
+      may_use_call = 0;
+      break;
+
+    case BLOCK_OP_NO_LIBCALL_RET:
+      may_use_call = -1;
       break;
 
     default:
@@ -1625,6 +1629,9 @@  emit_block_move_hints (rtx x, rtx y, rtx
 	   && ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (x))
 	   && ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (y)))
     {
+      if (may_use_call < 0)
+	return pc_rtx;
+
       /* Since x and y are passed to a libcall, mark the corresponding
 	 tree EXPR as addressable.  */
       tree y_expr = MEM_EXPR (y);
--- gcc/builtins.c.jj	2018-04-12 13:35:34.328395156 +0200
+++ gcc/builtins.c	2018-04-12 18:42:01.846616598 +0200
@@ -3650,12 +3650,16 @@  expand_builtin_memory_copy_args (tree de
   set_mem_align (src_mem, src_align);
 
   /* Copy word part most expediently.  */
-  dest_addr = emit_block_move_hints (dest_mem, src_mem, len_rtx,
-				     CALL_EXPR_TAILCALL (exp)
-				     && (endp == 0 || target == const0_rtx)
-				     ? BLOCK_OP_TAILCALL : BLOCK_OP_NORMAL,
+  enum block_op_methods method = BLOCK_OP_NORMAL;
+  if (CALL_EXPR_TAILCALL (exp) && (endp == 0 || target == const0_rtx))
+    method = BLOCK_OP_TAILCALL;
+  if (endp == 1 && target != const0_rtx)
+    method = BLOCK_OP_NO_LIBCALL_RET;
+  dest_addr = emit_block_move_hints (dest_mem, src_mem, len_rtx, method,
 				     expected_align, expected_size,
 				     min_size, max_size, probable_max_size);
+  if (dest_addr == pc_rtx)
+    return NULL_RTX;
 
   if (dest_addr == 0)
     {
--- gcc/testsuite/gcc.dg/string-opt-1.c.jj	2017-08-01 19:23:09.923512205 +0200
+++ gcc/testsuite/gcc.dg/string-opt-1.c	2018-04-12 18:57:10.940217129 +0200
@@ -1,4 +1,3 @@ 
-/* Ensure mempcpy is "optimized" into memcpy followed by addition.  */
 /* { dg-do compile } */
 /* { dg-options "-O2" } */
 
@@ -48,5 +47,5 @@  main (void)
   return 0;
 }
 
-/* { dg-final { scan-assembler-not "\<mempcpy\>" } } */
+/* { dg-final { scan-assembler "mempcpy" } } */
 /* { dg-final { scan-assembler "memcpy" } } */