diff mbox

[AArch64] Implement movmem for the benefit of inline memcpy

Message ID CA+=Sn1ma5kA6AZa_44DLiDyjVLH=rLvp0_TC0ck99PuGpmNK6A@mail.gmail.com
State New
Headers show

Commit Message

Andrew Pinski Aug. 5, 2014, 7:05 a.m. UTC
On Fri, Aug 1, 2014 at 2:21 AM,  <pinskia@gmail.com> wrote:
>
>
>> On Jun 6, 2014, at 1:50 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>>
>>
>> Hi,
>>
>> The move_by_pieces infrastructure performs a copy by repeatedly trying
>> the largest safe copy it can make. So for a 15-byte copy we might see:
>>
>> offset   amount  bytes copied
>> 0        8       0-7
>> 8        4       8-11
>> 12       2       12-13
>> 14       1       14
>>
>> However, we can implement a 15-byte copy as so:
>>
>> offset   amount  bytes copied
>> 0        8       0-7
>> 7        8       7-14
>>
>> Which can prove more efficient for both space and speed.
>>
>> In this patch we set MOVE_RATIO low to avoid using move_by_pieces, and
>> implement the movmem pattern name to expand small block copy cases. Note, this
>> optimization does not apply for -mstrict-align targets, which must continue
>> copying byte-by-byte.
>
> Why not change move_by_pieces instead of having a target specific code? This seems like a better option. You can check is unaligned slow target macro to see if you want to do this optimization too.   As I mentioned in the other email make sure you check the volatile ness of the from and to before doing this optimization.

Attached is the patch which does what I mentioned, I also changed
store_by_pieces to implement a similar optimization there (for memset
and strcpy).  Also since I used SLOW_UNALIGNED_ACCESS, this is a
generic optimization.

I had tested an earlier version on x86_64-linux-gnu and I am in the
middle of bootstrap/testing on this one.

Thanks,
Andrew Pinski

* expr.c (move_by_pieces):
Take the min of max_size and len to speed up things
and to take advatage of the mode in move_by_pieces_1.
(move_by_pieces_1): Read/write the leftovers using an overlapping
memory locations to reduce the number of reads/writes.
(store_by_pieces_1): Take the min of max_size and len to speed up things
and to take advatage of the mode in store_by_pieces_2.
(store_by_pieces_2): Write the leftovers using an overlapping
memory locations to reduce the number of writes.


>
> Thanks,
> Andrew
>
>
>>
>> Setting MOVE_RATIO low in this way causes a few tests to begin failing,
>> both of these are documented in the test-case as expected to fail for
>> low MOVE_RATIO targets, which do not allow certain Tree-Level optimizations.
>>
>> Bootstrapped on aarch64-unknown-linux-gnu with no issues.
>>
>> OK for trunk?
>>
>> Thanks,
>> James
>>
>> ---
>> gcc/
>>
>> 2014-06-06  James Greenhalgh  <james.greenhalgh@arm.com>
>>
>>    * config/aarch64/aarch64-protos.h (aarch64_expand_movmem): New.
>>    * config/aarch64/aarch64.c (aarch64_move_pointer): New.
>>    (aarch64_progress_pointer): Likewise.
>>    (aarch64_copy_one_part_and_move_pointers): Likewise.
>>    (aarch64_expand_movmen): Likewise.
>>    * config/aarch64/aarch64.h (MOVE_RATIO): Set low.
>>    * config/aarch64/aarch64.md (movmem<mode>): New.
>>
>> gcc/testsuite/
>>
>> 2014-06-06  James Greenhalgh  <james.greenhalgh@arm.com>
>>
>>    * gcc.dg/tree-ssa/pr42585.c: Skip for AArch64.
>>    * gcc.dg/tree-ssa/sra-12.c: Likewise.
>> <0001-AArch64-Implement-movmem-for-the-benefit-of-inline-m.patch>

Comments

James Greenhalgh Aug. 7, 2014, 2:20 p.m. UTC | #1
On Tue, Aug 05, 2014 at 08:05:00AM +0100, Andrew Pinski wrote:
> On Fri, Aug 1, 2014 at 2:21 AM,  <pinskia@gmail.com> wrote:
> >> On Jun 6, 2014, at 1:50 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> >>
> >>
> >> Hi,
> >>
> >> The move_by_pieces infrastructure performs a copy by repeatedly trying
> >> the largest safe copy it can make. So for a 15-byte copy we might see:
> >>
> >> offset   amount  bytes copied
> >> 0        8       0-7
> >> 8        4       8-11
> >> 12       2       12-13
> >> 14       1       14
> >>
> >> However, we can implement a 15-byte copy as so:
> >>
> >> offset   amount  bytes copied
> >> 0        8       0-7
> >> 7        8       7-14
> >>
> >> Which can prove more efficient for both space and speed.
> >>
> >> In this patch we set MOVE_RATIO low to avoid using move_by_pieces, and
> >> implement the movmem pattern name to expand small block copy cases. Note, this
> >> optimization does not apply for -mstrict-align targets, which must continue
> >> copying byte-by-byte.
> >
> > Why not change move_by_pieces instead of having a target specific code?
> > This seems like a better option. You can check is unaligned slow target
> > macro to see if you want to do this optimization too.   As I mentioned in
> > the other email make sure you check the volatile ness of the from and to
> > before doing this optimization.

Hi Andrew,

If we are converting these volatile copies to memcpy calls, then there is an
additional bug there. There is nothing in the C standard which imposes that
memcpy/memmov read and write each byte only once and it seems reasonable to
assume that the movmem optab inherits this lack of restrictions.  This gotcha
either needs fixed, or at least documented for the movmem optab.

If I'm going to fix this, I have to write it in the back-end. Failing
the expand will cause a buggy call to memcpy.  Having said that, I'm not
sure I've seen a good definition of the semantics of a volatile struct
copy. It feels to me that an element-by-element copy is more likely to
match user expectations than a chunk-by-chunk copy. It is probably too
late for me to get that by the time I am in memov, so I'll have to push
the fix earlier (probably somewhere generic?).

Do you know of a good write-up/argument/discussion on volatile struct
copy semantics? The testcase you provided is obviously broken, what is less
obvious is what should happen for:

struct __attribute__((packed)) t16{
  long long t8;
  int t4;
  short t2;
  unsigned char t1;
  unsigned char t1a;
};
volatile struct t16 t16;
int f(struct t16 *a)
{
  t16 = *a;
}

We have at least two choices...

> Attached is the patch which does what I mentioned, I also changed
> store_by_pieces to implement a similar optimization there (for memset
> and strcpy).  Also since I used SLOW_UNALIGNED_ACCESS, this is a
> generic optimization.

I'm not sure this helps your situation on AArch64. There are still AArch64
implementations for which we will want to bypass move_by_pieces and provide
a back-end implementation.

We could more reasonably be controlling this with MOVE_BY_PIECES_P, but
this is only a thin wrapper around MOVE_RATIO, so the result for you is
much the same (pending a patch fixing SRA not to read MOVE_RATIO, it should
make no difference whether we disable by MOVE_RATIO or MOVE_BY_PIECES_P).

Have you done much micro/macro-benchmarking to show that this is indeed
a sensible optimization for !SLOW_UNALIGNED_ACCESS targets? The documentation
suggests that SLOW_UNALIGNED_ACCESS should be set if unaligned accesses
are "many times" slower. This is a bit of a blunt hammer - there are likely
targets which will suffer from this optimization, but which don't set
SLOW_UNALIGNED_ACCESS. Maybe you need some finer grained cost function?

Thanks,
James

> I had tested an earlier version on x86_64-linux-gnu and I am in the
> middle of bootstrap/testing on this one.
> 
> Thanks,
> Andrew Pinski
> 
> * expr.c (move_by_pieces):
> Take the min of max_size and len to speed up things
> and to take advatage of the mode in move_by_pieces_1.
> (move_by_pieces_1): Read/write the leftovers using an overlapping
> memory locations to reduce the number of reads/writes.
> (store_by_pieces_1): Take the min of max_size and len to speed up things
> and to take advatage of the mode in store_by_pieces_2.
> (store_by_pieces_2): Write the leftovers using an overlapping
>
Richard Biener Aug. 7, 2014, 2:34 p.m. UTC | #2
On Thu, Aug 7, 2014 at 4:20 PM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
> On Tue, Aug 05, 2014 at 08:05:00AM +0100, Andrew Pinski wrote:
>> On Fri, Aug 1, 2014 at 2:21 AM,  <pinskia@gmail.com> wrote:
>> >> On Jun 6, 2014, at 1:50 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>> >>
>> >>
>> >> Hi,
>> >>
>> >> The move_by_pieces infrastructure performs a copy by repeatedly trying
>> >> the largest safe copy it can make. So for a 15-byte copy we might see:
>> >>
>> >> offset   amount  bytes copied
>> >> 0        8       0-7
>> >> 8        4       8-11
>> >> 12       2       12-13
>> >> 14       1       14
>> >>
>> >> However, we can implement a 15-byte copy as so:
>> >>
>> >> offset   amount  bytes copied
>> >> 0        8       0-7
>> >> 7        8       7-14
>> >>
>> >> Which can prove more efficient for both space and speed.
>> >>
>> >> In this patch we set MOVE_RATIO low to avoid using move_by_pieces, and
>> >> implement the movmem pattern name to expand small block copy cases. Note, this
>> >> optimization does not apply for -mstrict-align targets, which must continue
>> >> copying byte-by-byte.
>> >
>> > Why not change move_by_pieces instead of having a target specific code?
>> > This seems like a better option. You can check is unaligned slow target
>> > macro to see if you want to do this optimization too.   As I mentioned in
>> > the other email make sure you check the volatile ness of the from and to
>> > before doing this optimization.
>
> Hi Andrew,
>
> If we are converting these volatile copies to memcpy calls, then there is an
> additional bug there. There is nothing in the C standard which imposes that
> memcpy/memmov read and write each byte only once and it seems reasonable to
> assume that the movmem optab inherits this lack of restrictions.  This gotcha
> either needs fixed, or at least documented for the movmem optab.
>
> If I'm going to fix this, I have to write it in the back-end. Failing
> the expand will cause a buggy call to memcpy.  Having said that, I'm not
> sure I've seen a good definition of the semantics of a volatile struct
> copy. It feels to me that an element-by-element copy is more likely to
> match user expectations than a chunk-by-chunk copy. It is probably too
> late for me to get that by the time I am in memov, so I'll have to push
> the fix earlier (probably somewhere generic?).
>
> Do you know of a good write-up/argument/discussion on volatile struct
> copy semantics? The testcase you provided is obviously broken, what is less
> obvious is what should happen for:
>
> struct __attribute__((packed)) t16{
>   long long t8;
>   int t4;
>   short t2;
>   unsigned char t1;
>   unsigned char t1a;
> };
> volatile struct t16 t16;
> int f(struct t16 *a)
> {
>   t16 = *a;
> }
>
> We have at least two choices...

It's the language frontends job to present the middle-end with
something sensible.  For example an element-wise copy.

Also consider

struct X { int i; volatile int j; int k[1024]; } x;
void f (struct X *a)
{
  x = *a;
}

Richard.

>> Attached is the patch which does what I mentioned, I also changed
>> store_by_pieces to implement a similar optimization there (for memset
>> and strcpy).  Also since I used SLOW_UNALIGNED_ACCESS, this is a
>> generic optimization.
>
> I'm not sure this helps your situation on AArch64. There are still AArch64
> implementations for which we will want to bypass move_by_pieces and provide
> a back-end implementation.
>
> We could more reasonably be controlling this with MOVE_BY_PIECES_P, but
> this is only a thin wrapper around MOVE_RATIO, so the result for you is
> much the same (pending a patch fixing SRA not to read MOVE_RATIO, it should
> make no difference whether we disable by MOVE_RATIO or MOVE_BY_PIECES_P).
>
> Have you done much micro/macro-benchmarking to show that this is indeed
> a sensible optimization for !SLOW_UNALIGNED_ACCESS targets? The documentation
> suggests that SLOW_UNALIGNED_ACCESS should be set if unaligned accesses
> are "many times" slower. This is a bit of a blunt hammer - there are likely
> targets which will suffer from this optimization, but which don't set
> SLOW_UNALIGNED_ACCESS. Maybe you need some finer grained cost function?
>
> Thanks,
> James
>
>> I had tested an earlier version on x86_64-linux-gnu and I am in the
>> middle of bootstrap/testing on this one.
>>
>> Thanks,
>> Andrew Pinski
>>
>> * expr.c (move_by_pieces):
>> Take the min of max_size and len to speed up things
>> and to take advatage of the mode in move_by_pieces_1.
>> (move_by_pieces_1): Read/write the leftovers using an overlapping
>> memory locations to reduce the number of reads/writes.
>> (store_by_pieces_1): Take the min of max_size and len to speed up things
>> and to take advatage of the mode in store_by_pieces_2.
>> (store_by_pieces_2): Write the leftovers using an overlapping
>>
diff mbox

Patch

Index: expr.c
===================================================================
--- expr.c	(revision 213306)
+++ expr.c	(working copy)
@@ -876,6 +876,9 @@  move_by_pieces (rtx to, rtx from, unsign
   if (data.reverse) data.offset = len;
   data.len = len;
 
+  /* Use the MIN of the length and the max size we can use. */
+  max_size = max_size > (len + 1) ? (len + 1) : max_size;
+
   /* If copying requires more than two move insns,
      copy addresses to registers (to make displacements shorter)
      and use post-increment if available.  */
@@ -1073,6 +1076,32 @@  move_by_pieces_1 (insn_gen_fn genfun, ma
 
       data->len -= size;
     }
+
+  /* If we have some data left and unalign accesses
+     are not slow, back up slightly and emit the move. */
+  if (data->len > 0
+      && !STRICT_ALIGNMENT
+      && !SLOW_UNALIGNED_ACCESS (mode, 1)
+      /* Not a stack push */
+      && data->to
+      /* Neither side is volatile memory. */
+      && !MEM_VOLATILE_P (data->to)
+      && !MEM_VOLATILE_P (data->from)
+      && ceil_log2 (data->len) == exact_log2 (size)
+      /* No incrementing of the to or from. */
+      && data->explicit_inc_to == 0
+      && data->explicit_inc_from == 0
+      /* No auto-incrementing of the to or from. */
+      && !data->autinc_to
+      && !data->autinc_from
+      && !data->reverse)
+    {
+      unsigned offset = data->offset - (size - data->len);
+      to1 = adjust_address (data->to, mode, offset);
+      from1 = adjust_address (data->from, mode, offset);
+      emit_insn ((*genfun) (to1, from1));
+      data->len = 0;
+    }
 }
 
 /* Emit code to move a block Y to a block X.  This may be done with
@@ -2636,6 +2665,9 @@  store_by_pieces_1 (struct store_by_piece
   if (data->reverse)
     data->offset = data->len;
 
+  /* Use the MIN of the length and the max size we can use. */
+  max_size = max_size > (data->len + 1) ? (data->len + 1)  : max_size;
+
   /* If storing requires more than two move insns,
      copy addresses to registers (to make displacements shorter)
      and use post-increment if available.  */
@@ -2733,6 +2765,24 @@  store_by_pieces_2 (insn_gen_fn genfun, m
 
       data->len -= size;
     }
+
+  /* If we have some data left and unalign accesses
+     are not slow, back up slightly and emit that constant.  */
+  if (data->len > 0
+      && !STRICT_ALIGNMENT
+      && !SLOW_UNALIGNED_ACCESS (mode, 1)
+      && !MEM_VOLATILE_P (data->to)
+      && ceil_log2 (data->len) == exact_log2 (size)
+      && data->explicit_inc_to == 0
+      && !data->autinc_to
+      && !data->reverse)
+    {
+      unsigned offset = data->offset - (size - data->len);
+      to1 = adjust_address (data->to, mode, offset);
+      cst = (*data->constfun) (data->constfundata, offset, mode);
+      emit_insn ((*genfun) (to1, cst));
+      data->len = 0;
+    }
 }
 
 /* Write zeros through the storage of OBJECT.  If OBJECT has BLKmode, SIZE is