diff mbox

[tentative] Use MOVE_MAX_PIECES instead of MOVE_MAX in gimple_fold_builtin_memory_op

Message ID 871sy180ie.fsf@atmel.com
State New
Headers show

Commit Message

Senthil Kumar Selvaraj Nov. 24, 2016, 1:34 p.m. UTC
Hi,

  I've been analyzing a failing regtest (gcc.dg/strlenopt-8.c) for the avr
  target. I found that the (dump) failure is because there are 4
  instances of memcpy, while the testcase expects only 2 for a
  non-strict align target like the avr.

  Comparing that with a dump generated by x64_64-pc-linux, I found that
  the extra memcpy's come from the forwprop pass, when it replaces
  strcat with strlen and memcpy. For x86_64, the memcpy generated gets
  folded into a load/store in gimple_fold_builtin_memory_op. That
  doesn't happen for the avr because len (2) happens to be bigger than
  MOVE_MAX (1).

  The avr can only move 1 byte efficiently from reg <-> memory, but it's
  more efficient to load and store 2 bytes than to call memcpy, so
  MOVE_MAX_PIECES is set to 2.

  Given that gimple_fold_builtin_memory_op gets to choose between
  leaving the memcpy call as is, or breaking it down to a by-pieces
  move, shouldn't it use MOVE_MAX_PIECES instead of
  MOV_MAX?

  That is what the below patch does, and that makes the test
  pass. Does this sound right?

Regards
Senthil

Comments

Richard Biener Nov. 24, 2016, 1:57 p.m. UTC | #1
On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote:

> Hi,
> 
>   I've been analyzing a failing regtest (gcc.dg/strlenopt-8.c) for the avr
>   target. I found that the (dump) failure is because there are 4
>   instances of memcpy, while the testcase expects only 2 for a
>   non-strict align target like the avr.
> 
>   Comparing that with a dump generated by x64_64-pc-linux, I found that
>   the extra memcpy's come from the forwprop pass, when it replaces
>   strcat with strlen and memcpy. For x86_64, the memcpy generated gets
>   folded into a load/store in gimple_fold_builtin_memory_op. That
>   doesn't happen for the avr because len (2) happens to be bigger than
>   MOVE_MAX (1).
> 
>   The avr can only move 1 byte efficiently from reg <-> memory, but it's
>   more efficient to load and store 2 bytes than to call memcpy, so
>   MOVE_MAX_PIECES is set to 2.
> 
>   Given that gimple_fold_builtin_memory_op gets to choose between
>   leaving the memcpy call as is, or breaking it down to a by-pieces
>   move, shouldn't it use MOVE_MAX_PIECES instead of
>   MOV_MAX?
> 
>   That is what the below patch does, and that makes the test
>   pass. Does this sound right?

No, as we handle both memcpy and memmove this way we rely on
the whole storage fit in a single register so we do the
right thing for overlapping memory.

Richard.

> Regards
> Senthil
> 
> Index: gcc/gimple-fold.c
> ===================================================================
> --- gcc/gimple-fold.c	(revision 242741)
> +++ gcc/gimple-fold.c	(working copy)
> @@ -703,7 +703,7 @@
>        src_align = get_pointer_alignment (src);
>        dest_align = get_pointer_alignment (dest);
>        if (tree_fits_uhwi_p (len)
> -	  && compare_tree_int (len, MOVE_MAX) <= 0
> +	  && compare_tree_int (len, MOVE_MAX_PIECES) <= 0
>  	  /* ???  Don't transform copies from strings with known length this
>  	     confuses the tree-ssa-strlen.c.  This doesn't handle
>  	     the case in gcc.dg/strlenopt-8.c which is XFAILed for that
> 
>
Richard Biener Nov. 24, 2016, 3:09 p.m. UTC | #2
On Thu, 24 Nov 2016, Richard Biener wrote:

> On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote:
> 
> > Hi,
> > 
> >   I've been analyzing a failing regtest (gcc.dg/strlenopt-8.c) for the avr
> >   target. I found that the (dump) failure is because there are 4
> >   instances of memcpy, while the testcase expects only 2 for a
> >   non-strict align target like the avr.
> > 
> >   Comparing that with a dump generated by x64_64-pc-linux, I found that
> >   the extra memcpy's come from the forwprop pass, when it replaces
> >   strcat with strlen and memcpy. For x86_64, the memcpy generated gets
> >   folded into a load/store in gimple_fold_builtin_memory_op. That
> >   doesn't happen for the avr because len (2) happens to be bigger than
> >   MOVE_MAX (1).
> > 
> >   The avr can only move 1 byte efficiently from reg <-> memory, but it's
> >   more efficient to load and store 2 bytes than to call memcpy, so
> >   MOVE_MAX_PIECES is set to 2.
> > 
> >   Given that gimple_fold_builtin_memory_op gets to choose between
> >   leaving the memcpy call as is, or breaking it down to a by-pieces
> >   move, shouldn't it use MOVE_MAX_PIECES instead of
> >   MOV_MAX?
> > 
> >   That is what the below patch does, and that makes the test
> >   pass. Does this sound right?
> 
> No, as we handle both memcpy and memmove this way we rely on
> the whole storage fit in a single register so we do the
> right thing for overlapping memory.

So actually your patch doesn't chnage that, the ordering is ensured
by emitting a single GIMPLE load/store pair.  There are only
four targets defining MOVE_MAX_PIECES, and one (s390) even has
a smaller MOVE_MAX_PIECES than MOVE_MAX (huh).  AVR has larger
MOVE_MAX_PIECES than MOVE_MAX, but that seems to not make much
sense to me given their very similar description plus the
fact that AVR can only load a single byte at a time...

The x86 comment says

/* MOVE_MAX_PIECES is the number of bytes at a time which we can
   move efficiently, as opposed to  MOVE_MAX which is the maximum
   number of bytes we can move with a single instruction.

which doesn't match up with

@defmac MOVE_MAX
The maximum number of bytes that a single instruction can move quickly
between memory and registers or between two memory locations.
@end defmac

note "quickly" here.  But OTOH

@defmac MOVE_MAX_PIECES
A C expression used by @code{move_by_pieces} to determine the largest unit
a load or store used to copy memory is.  Defaults to @code{MOVE_MAX}.
@end defmac

here the only difference is "copy memory".  But we try to special
case the one load - one store case, not generally "copy memory".

So I think MOVE_MAX matches my intent when writing the code.

Richard.

> Richard.
> 
> > Regards
> > Senthil
> > 
> > Index: gcc/gimple-fold.c
> > ===================================================================
> > --- gcc/gimple-fold.c	(revision 242741)
> > +++ gcc/gimple-fold.c	(working copy)
> > @@ -703,7 +703,7 @@
> >        src_align = get_pointer_alignment (src);
> >        dest_align = get_pointer_alignment (dest);
> >        if (tree_fits_uhwi_p (len)
> > -	  && compare_tree_int (len, MOVE_MAX) <= 0
> > +	  && compare_tree_int (len, MOVE_MAX_PIECES) <= 0
> >  	  /* ???  Don't transform copies from strings with known length this
> >  	     confuses the tree-ssa-strlen.c.  This doesn't handle
> >  	     the case in gcc.dg/strlenopt-8.c which is XFAILed for that
> > 
> > 
> 
>
Senthil Kumar Selvaraj Nov. 24, 2016, 5:43 p.m. UTC | #3
Richard Biener writes:

> On Thu, 24 Nov 2016, Richard Biener wrote:
>
>> On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote:
>> 
>> > Hi,
>> > 
>> >   I've been analyzing a failing regtest (gcc.dg/strlenopt-8.c) for the avr
>> >   target. I found that the (dump) failure is because there are 4
>> >   instances of memcpy, while the testcase expects only 2 for a
>> >   non-strict align target like the avr.
>> > 
>> >   Comparing that with a dump generated by x64_64-pc-linux, I found that
>> >   the extra memcpy's come from the forwprop pass, when it replaces
>> >   strcat with strlen and memcpy. For x86_64, the memcpy generated gets
>> >   folded into a load/store in gimple_fold_builtin_memory_op. That
>> >   doesn't happen for the avr because len (2) happens to be bigger than
>> >   MOVE_MAX (1).
>> > 
>> >   The avr can only move 1 byte efficiently from reg <-> memory, but it's
>> >   more efficient to load and store 2 bytes than to call memcpy, so
>> >   MOVE_MAX_PIECES is set to 2.
>> > 
>> >   Given that gimple_fold_builtin_memory_op gets to choose between
>> >   leaving the memcpy call as is, or breaking it down to a by-pieces
>> >   move, shouldn't it use MOVE_MAX_PIECES instead of
>> >   MOV_MAX?
>> > 
>> >   That is what the below patch does, and that makes the test
>> >   pass. Does this sound right?
>> 
>> No, as we handle both memcpy and memmove this way we rely on
>> the whole storage fit in a single register so we do the
>> right thing for overlapping memory.
>
> So actually your patch doesn't chnage that, the ordering is ensured
> by emitting a single GIMPLE load/store pair.  There are only
> four targets defining MOVE_MAX_PIECES, and one (s390) even has
> a smaller MOVE_MAX_PIECES than MOVE_MAX (huh).  AVR has larger
> MOVE_MAX_PIECES than MOVE_MAX, but that seems to not make much
> sense to me given their very similar description plus the
> fact that AVR can only load a single byte at a time...
>
> The x86 comment says
>
> /* MOVE_MAX_PIECES is the number of bytes at a time which we can
>    move efficiently, as opposed to  MOVE_MAX which is the maximum
>    number of bytes we can move with a single instruction.
>
> which doesn't match up with
>
> @defmac MOVE_MAX
> The maximum number of bytes that a single instruction can move quickly
> between memory and registers or between two memory locations.
> @end defmac
>
> note "quickly" here.  But OTOH
>
> @defmac MOVE_MAX_PIECES
> A C expression used by @code{move_by_pieces} to determine the largest unit
> a load or store used to copy memory is.  Defaults to @code{MOVE_MAX}.
> @end defmac
>
> here the only difference is "copy memory".  But we try to special
> case the one load - one store case, not generally "copy memory".
>
> So I think MOVE_MAX matches my intent when writing the code.

Ok, but isn't that inconsistent with tree-inline.c:estimate_move_cost, which
considers MOVE_MAX_PIECES and MOVE_RATIO to decide between a libcall and
by-pieces move?

Regards
Senthil

>
> Richard.
>
>> Richard.
>> 
>> > Regards
>> > Senthil
>> > 
>> > Index: gcc/gimple-fold.c
>> > ===================================================================
>> > --- gcc/gimple-fold.c	(revision 242741)
>> > +++ gcc/gimple-fold.c	(working copy)
>> > @@ -703,7 +703,7 @@
>> >        src_align = get_pointer_alignment (src);
>> >        dest_align = get_pointer_alignment (dest);
>> >        if (tree_fits_uhwi_p (len)
>> > -	  && compare_tree_int (len, MOVE_MAX) <= 0
>> > +	  && compare_tree_int (len, MOVE_MAX_PIECES) <= 0
>> >  	  /* ???  Don't transform copies from strings with known length this
>> >  	     confuses the tree-ssa-strlen.c.  This doesn't handle
>> >  	     the case in gcc.dg/strlenopt-8.c which is XFAILed for that
>> > 
>> > 
>> 
>>
Richard Biener Nov. 25, 2016, 8:04 a.m. UTC | #4
On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote:

> 
> Richard Biener writes:
> 
> > On Thu, 24 Nov 2016, Richard Biener wrote:
> >
> >> On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote:
> >> 
> >> > Hi,
> >> > 
> >> >   I've been analyzing a failing regtest (gcc.dg/strlenopt-8.c) for the avr
> >> >   target. I found that the (dump) failure is because there are 4
> >> >   instances of memcpy, while the testcase expects only 2 for a
> >> >   non-strict align target like the avr.
> >> > 
> >> >   Comparing that with a dump generated by x64_64-pc-linux, I found that
> >> >   the extra memcpy's come from the forwprop pass, when it replaces
> >> >   strcat with strlen and memcpy. For x86_64, the memcpy generated gets
> >> >   folded into a load/store in gimple_fold_builtin_memory_op. That
> >> >   doesn't happen for the avr because len (2) happens to be bigger than
> >> >   MOVE_MAX (1).
> >> > 
> >> >   The avr can only move 1 byte efficiently from reg <-> memory, but it's
> >> >   more efficient to load and store 2 bytes than to call memcpy, so
> >> >   MOVE_MAX_PIECES is set to 2.
> >> > 
> >> >   Given that gimple_fold_builtin_memory_op gets to choose between
> >> >   leaving the memcpy call as is, or breaking it down to a by-pieces
> >> >   move, shouldn't it use MOVE_MAX_PIECES instead of
> >> >   MOV_MAX?
> >> > 
> >> >   That is what the below patch does, and that makes the test
> >> >   pass. Does this sound right?
> >> 
> >> No, as we handle both memcpy and memmove this way we rely on
> >> the whole storage fit in a single register so we do the
> >> right thing for overlapping memory.
> >
> > So actually your patch doesn't chnage that, the ordering is ensured
> > by emitting a single GIMPLE load/store pair.  There are only
> > four targets defining MOVE_MAX_PIECES, and one (s390) even has
> > a smaller MOVE_MAX_PIECES than MOVE_MAX (huh).  AVR has larger
> > MOVE_MAX_PIECES than MOVE_MAX, but that seems to not make much
> > sense to me given their very similar description plus the
> > fact that AVR can only load a single byte at a time...
> >
> > The x86 comment says
> >
> > /* MOVE_MAX_PIECES is the number of bytes at a time which we can
> >    move efficiently, as opposed to  MOVE_MAX which is the maximum
> >    number of bytes we can move with a single instruction.
> >
> > which doesn't match up with
> >
> > @defmac MOVE_MAX
> > The maximum number of bytes that a single instruction can move quickly
> > between memory and registers or between two memory locations.
> > @end defmac
> >
> > note "quickly" here.  But OTOH
> >
> > @defmac MOVE_MAX_PIECES
> > A C expression used by @code{move_by_pieces} to determine the largest unit
> > a load or store used to copy memory is.  Defaults to @code{MOVE_MAX}.
> > @end defmac
> >
> > here the only difference is "copy memory".  But we try to special
> > case the one load - one store case, not generally "copy memory".
> >
> > So I think MOVE_MAX matches my intent when writing the code.
> 
> Ok, but isn't that inconsistent with tree-inline.c:estimate_move_cost, which
> considers MOVE_MAX_PIECES and MOVE_RATIO to decide between a libcall and
> by-pieces move?

Well, I don't understand why we have both MOVE_MAX and MOVE_MAX_PIECES.
There are exactly two uses of MOVE_MAX in GCC AFAICS, the gimple-fold.c
one and caller-save.c which derives MOVE_MAX_WORDS from it.  
MOVE_MAX_PIECES has the only use in block move expansion plus the
single use in tree-inline.c.

So I can't give a reason why one or the other should be more valid
but the tree-inline.c one tries to match memcpy expansion (obviously),
while the gimple-fold.c one tries to get at the maximum possible
single-insn move amount (and AVR is odd here having that lower than
MOVE_MAX_PIECES, compared to say s390 which has it the opposite way
around).

Richard.

> Regards
> Senthil
> 
> >
> > Richard.
> >
> >> Richard.
> >> 
> >> > Regards
> >> > Senthil
> >> > 
> >> > Index: gcc/gimple-fold.c
> >> > ===================================================================
> >> > --- gcc/gimple-fold.c	(revision 242741)
> >> > +++ gcc/gimple-fold.c	(working copy)
> >> > @@ -703,7 +703,7 @@
> >> >        src_align = get_pointer_alignment (src);
> >> >        dest_align = get_pointer_alignment (dest);
> >> >        if (tree_fits_uhwi_p (len)
> >> > -	  && compare_tree_int (len, MOVE_MAX) <= 0
> >> > +	  && compare_tree_int (len, MOVE_MAX_PIECES) <= 0
> >> >  	  /* ???  Don't transform copies from strings with known length this
> >> >  	     confuses the tree-ssa-strlen.c.  This doesn't handle
> >> >  	     the case in gcc.dg/strlenopt-8.c which is XFAILed for that
> >> > 
> >> > 
> >> 
> >> 
> 
>
Jeff Law Nov. 28, 2016, 9:58 p.m. UTC | #5
On 11/25/2016 01:04 AM, Richard Biener wrote:
> On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote:
>
>>
>> Richard Biener writes:
>>
>>> On Thu, 24 Nov 2016, Richard Biener wrote:
>>>
>>>> On Thu, 24 Nov 2016, Senthil Kumar Selvaraj wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>   I've been analyzing a failing regtest (gcc.dg/strlenopt-8.c) for the avr
>>>>>   target. I found that the (dump) failure is because there are 4
>>>>>   instances of memcpy, while the testcase expects only 2 for a
>>>>>   non-strict align target like the avr.
>>>>>
>>>>>   Comparing that with a dump generated by x64_64-pc-linux, I found that
>>>>>   the extra memcpy's come from the forwprop pass, when it replaces
>>>>>   strcat with strlen and memcpy. For x86_64, the memcpy generated gets
>>>>>   folded into a load/store in gimple_fold_builtin_memory_op. That
>>>>>   doesn't happen for the avr because len (2) happens to be bigger than
>>>>>   MOVE_MAX (1).
>>>>>
>>>>>   The avr can only move 1 byte efficiently from reg <-> memory, but it's
>>>>>   more efficient to load and store 2 bytes than to call memcpy, so
>>>>>   MOVE_MAX_PIECES is set to 2.
>>>>>
>>>>>   Given that gimple_fold_builtin_memory_op gets to choose between
>>>>>   leaving the memcpy call as is, or breaking it down to a by-pieces
>>>>>   move, shouldn't it use MOVE_MAX_PIECES instead of
>>>>>   MOV_MAX?
>>>>>
>>>>>   That is what the below patch does, and that makes the test
>>>>>   pass. Does this sound right?
>>>>
>>>> No, as we handle both memcpy and memmove this way we rely on
>>>> the whole storage fit in a single register so we do the
>>>> right thing for overlapping memory.
>>>
>>> So actually your patch doesn't chnage that, the ordering is ensured
>>> by emitting a single GIMPLE load/store pair.  There are only
>>> four targets defining MOVE_MAX_PIECES, and one (s390) even has
>>> a smaller MOVE_MAX_PIECES than MOVE_MAX (huh).  AVR has larger
>>> MOVE_MAX_PIECES than MOVE_MAX, but that seems to not make much
>>> sense to me given their very similar description plus the
>>> fact that AVR can only load a single byte at a time...
>>>
>>> The x86 comment says
>>>
>>> /* MOVE_MAX_PIECES is the number of bytes at a time which we can
>>>    move efficiently, as opposed to  MOVE_MAX which is the maximum
>>>    number of bytes we can move with a single instruction.
>>>
>>> which doesn't match up with
>>>
>>> @defmac MOVE_MAX
>>> The maximum number of bytes that a single instruction can move quickly
>>> between memory and registers or between two memory locations.
>>> @end defmac
>>>
>>> note "quickly" here.  But OTOH
>>>
>>> @defmac MOVE_MAX_PIECES
>>> A C expression used by @code{move_by_pieces} to determine the largest unit
>>> a load or store used to copy memory is.  Defaults to @code{MOVE_MAX}.
>>> @end defmac
>>>
>>> here the only difference is "copy memory".  But we try to special
>>> case the one load - one store case, not generally "copy memory".
>>>
>>> So I think MOVE_MAX matches my intent when writing the code.
>>
>> Ok, but isn't that inconsistent with tree-inline.c:estimate_move_cost, which
>> considers MOVE_MAX_PIECES and MOVE_RATIO to decide between a libcall and
>> by-pieces move?
>
> Well, I don't understand why we have both MOVE_MAX and MOVE_MAX_PIECES.
> There are exactly two uses of MOVE_MAX in GCC AFAICS, the gimple-fold.c
> one and caller-save.c which derives MOVE_MAX_WORDS from it.
> MOVE_MAX_PIECES has the only use in block move expansion plus the
> single use in tree-inline.c.
>
> So I can't give a reason why one or the other should be more valid
> but the tree-inline.c one tries to match memcpy expansion (obviously),
> while the gimple-fold.c one tries to get at the maximum possible
> single-insn move amount (and AVR is odd here having that lower than
> MOVE_MAX_PIECES, compared to say s390 which has it the opposite way
> around).
It's probably historical.  MOVE_MAX_PIECES, IIRC, was primarily to guide 
whether or not to emit a structure copy inline vs call memcpy.

MOVE_MAX, IIRC, was supposed to control how many consecutive registers 
we'd try to squash into a single save/restore during caller-saves.   ie, 
it might be profitable to squash a pair of 32 bit saves/restores into a 
single 64 bit save/restore.  But it may not have been profitable to 
squash a pair of 64bit saves/restores into a 128bit save/restore.



There may have been other uses of each that have been removed over time.

jeff
diff mbox

Patch

Index: gcc/gimple-fold.c
===================================================================
--- gcc/gimple-fold.c	(revision 242741)
+++ gcc/gimple-fold.c	(working copy)
@@ -703,7 +703,7 @@ 
       src_align = get_pointer_alignment (src);
       dest_align = get_pointer_alignment (dest);
       if (tree_fits_uhwi_p (len)
-	  && compare_tree_int (len, MOVE_MAX) <= 0
+	  && compare_tree_int (len, MOVE_MAX_PIECES) <= 0
 	  /* ???  Don't transform copies from strings with known length this
 	     confuses the tree-ssa-strlen.c.  This doesn't handle
 	     the case in gcc.dg/strlenopt-8.c which is XFAILed for that