diff mbox

Reenable CSE of non-volatile inline asm (PR rtl-optimization/63637)

Message ID 20150113161819.GD1405@tucnak.redhat.com
State New
Headers show

Commit Message

Jakub Jelinek Jan. 13, 2015, 4:18 p.m. UTC
Hi!

My PR60663 fix unfortunately stopped CSE of all inline-asms, even when
they e.g. only have the clobbers added by default.

This patch attempts to restore the old behavior, with the exceptions:
1) as always, asm volatile is not CSEd
2) inline-asm with multiple outputs are not CSEd
3) on request from Richard (which Segher on IRC argues against), "memory"
   clobber also prevents CSE; this can be removed by removing the
   int j, lim = XVECLEN (x, 0); and loop below it
4) inline-asm with clobbers is never copied into an insn that wasn't
   inline-asm before, so if there are clobbers, we allow CSEing of
   e.g. two same inline-asms, but only by reusing results of one
   of those

Bootstrapped/regtested on x86_64-linux and i686-linux, tested also
with arm cross after reverting the PR60663 arm cost fix.

Ok for trunk this way, or with 3) removed?

2015-01-13  Jakub Jelinek  <jakub@redhat.com>

	PR rtl-optimization/63637
	PR rtl-optimization/60663
	* cse.c (merge_equiv_classes): Set new_elt->cost to MAX_COST
	if elt->cost is MAX_COST for ASM_OPERANDS.
	(find_sets_in_insn): Fix up comment typo.
	(cse_insn): Don't set src_volatile for all non-volatile
	ASM_OPERANDS in PARALLELs, but just those with multiple outputs
	or with "memory" clobber.  Set elt->cost to MAX_COST
	for ASM_OPERANDS in PARALLEL.  Set src_elt->cost to MAX_COST
	if new_src is ASM_OPERANDS and elt->cost is MAX_COST.

	* gcc.dg/pr63637-1.c: New test.
	* gcc.dg/pr63637-2.c: New test.
	* gcc.dg/pr63637-3.c: New test.
	* gcc.dg/pr63637-4.c: New test.
	* gcc.dg/pr63637-5.c: New test.
	* gcc.dg/pr63637-6.c: New test.
	* gcc.target/i386/pr63637-1.c: New test.
	* gcc.target/i386/pr63637-2.c: New test.
	* gcc.target/i386/pr63637-3.c: New test.
	* gcc.target/i386/pr63637-4.c: New test.
	* gcc.target/i386/pr63637-5.c: New test.
	* gcc.target/i386/pr63637-6.c: New test.


	Jakub

Comments

Segher Boessenkool Jan. 13, 2015, 4:38 p.m. UTC | #1
On Tue, Jan 13, 2015 at 05:18:19PM +0100, Jakub Jelinek wrote:
> 3) on request from Richard (which Segher on IRC argues against), "memory"
>    clobber also prevents CSE;

As extend.texi used to say:

"
If your assembler instructions access memory in an unpredictable
fashion, add @samp{memory} to the list of clobbered registers.  This
causes GCC to not keep memory values cached in registers across the
assembler instruction and not optimize stores or loads to that memory.
You also should add the @code{volatile} keyword if the memory
affected is not listed in the inputs or outputs of the @code{asm}, as
the @samp{memory} clobber does not count as a side-effect of the
@code{asm}.
"

so a "memory" clobber in a non-volatile asm should not prevent CSE.


Segher
Jeff Law Jan. 13, 2015, 7:45 p.m. UTC | #2
On 01/13/15 09:38, Segher Boessenkool wrote:
> On Tue, Jan 13, 2015 at 05:18:19PM +0100, Jakub Jelinek wrote:
>> 3) on request from Richard (which Segher on IRC argues against), "memory"
>>     clobber also prevents CSE;
>
> As extend.texi used to say:
>
> "
> If your assembler instructions access memory in an unpredictable
> fashion, add @samp{memory} to the list of clobbered registers.  This
> causes GCC to not keep memory values cached in registers across the
> assembler instruction and not optimize stores or loads to that memory.
> You also should add the @code{volatile} keyword if the memory
> affected is not listed in the inputs or outputs of the @code{asm}, as
> the @samp{memory} clobber does not count as a side-effect of the
> @code{asm}.
> "
>
> so a "memory" clobber in a non-volatile asm should not prevent CSE.
My reading of that paragraph is somewhat different.

The key here is the memory clobber affects optimization of instructions 
around the asm while the volatile specifier affects the optimization of 
the ASM itself.

A memory clobber must inhibit CSE of memory references on either side of 
the asm because the asm must be assumed to read or write memory in 
unpredictable ways.

The volatile specifier tells the compiler that the asm itself must be 
preserved, even if dataflow shows the outputs as not used.

eff
Jakub Jelinek Jan. 13, 2015, 8:13 p.m. UTC | #3
On Tue, Jan 13, 2015 at 12:45:27PM -0700, Jeff Law wrote:
> On 01/13/15 09:38, Segher Boessenkool wrote:
> >On Tue, Jan 13, 2015 at 05:18:19PM +0100, Jakub Jelinek wrote:
> >>3) on request from Richard (which Segher on IRC argues against), "memory"
> >>    clobber also prevents CSE;
> >
> >As extend.texi used to say:
> >
> >"
> >If your assembler instructions access memory in an unpredictable
> >fashion, add @samp{memory} to the list of clobbered registers.  This
> >causes GCC to not keep memory values cached in registers across the
> >assembler instruction and not optimize stores or loads to that memory.
> >You also should add the @code{volatile} keyword if the memory
> >affected is not listed in the inputs or outputs of the @code{asm}, as
> >the @samp{memory} clobber does not count as a side-effect of the
> >@code{asm}.
> >"
> >
> >so a "memory" clobber in a non-volatile asm should not prevent CSE.
> My reading of that paragraph is somewhat different.
> 
> The key here is the memory clobber affects optimization of instructions
> around the asm while the volatile specifier affects the optimization of the
> ASM itself.
> 
> A memory clobber must inhibit CSE of memory references on either side of the
> asm because the asm must be assumed to read or write memory in unpredictable
> ways.
> 
> The volatile specifier tells the compiler that the asm itself must be
> preserved, even if dataflow shows the outputs as not used.

That is not necessarily in conflict.
My reading of Jeff's comment is that in
int a;
int
foo (void)
{
  int b, c, d, e;
  b = a;
  asm ("..." : "=r" (c) : : "memory");
  d = a;
  asm ("..." : "=r" (e) : : "memory");
  return b + d + 2 * (c + e);
}
we are not allowed to CSE d = a; into d = b;.  CSE invalidate_from_clobbers
should ensure that already, even when we don't do anything special about
"memory" clobber in the patch.  Another thing is if there is a store
in between the two non-volatile asms with "memory" clobber, here I'm not
sure if with the alternate patch we'd treat the "memory" clobber as use of
everything previously stored into memory (in this regard the posted version
is safe).
And finally there is the case of non-volatile asm with "memory" clobber with
no memory stores in between the two - the posted (safer) patch will not
allow to CSE the two, while in theory we could CSE them into just one asm.

	Jakub
Jeff Law Jan. 13, 2015, 10:17 p.m. UTC | #4
On 01/13/15 13:13, Jakub Jelinek wrote:
> On Tue, Jan 13, 2015 at 12:45:27PM -0700, Jeff Law wrote:
>> On 01/13/15 09:38, Segher Boessenkool wrote:
>>> On Tue, Jan 13, 2015 at 05:18:19PM +0100, Jakub Jelinek wrote:
>>>> 3) on request from Richard (which Segher on IRC argues against), "memory"
>>>>     clobber also prevents CSE;
>>>
>>> As extend.texi used to say:
>>>
>>> "
>>> If your assembler instructions access memory in an unpredictable
>>> fashion, add @samp{memory} to the list of clobbered registers.  This
>>> causes GCC to not keep memory values cached in registers across the
>>> assembler instruction and not optimize stores or loads to that memory.
>>> You also should add the @code{volatile} keyword if the memory
>>> affected is not listed in the inputs or outputs of the @code{asm}, as
>>> the @samp{memory} clobber does not count as a side-effect of the
>>> @code{asm}.
>>> "
>>>
>>> so a "memory" clobber in a non-volatile asm should not prevent CSE.
>> My reading of that paragraph is somewhat different.
>>
>> The key here is the memory clobber affects optimization of instructions
>> around the asm while the volatile specifier affects the optimization of the
>> ASM itself.
>>
>> A memory clobber must inhibit CSE of memory references on either side of the
>> asm because the asm must be assumed to read or write memory in unpredictable
>> ways.
>>
>> The volatile specifier tells the compiler that the asm itself must be
>> preserved, even if dataflow shows the outputs as not used.
>
> That is not necessarily in conflict.
Possibly not :-)  This stuff isn't trivial and as well meaning as folks 
trying to update the docs have been, it's possible subtle issues like 
these have been missed, even in the review process.

I know that for me it's easier to reason about code changes like this 
than their associated documentation :-)


> My reading of Jeff's comment is that in
> int a;
> int
> foo (void)
> {
>    int b, c, d, e;
>    b = a;
>    asm ("..." : "=r" (c) : : "memory");
>    d = a;
>    asm ("..." : "=r" (e) : : "memory");
>    return b + d + 2 * (c + e);
> }
> we are not allowed to CSE d = a; into d = b;
Precisely.  At least that's how I read things and it makes sense to the 
part of my brain that used to split time between kernel & GCC 
development in a previous life.

In effect the "memory" clobber is an aggregation of the read, write, 
clobbers dataflow for memory (and imprecise as it hits all memory).


.  CSE invalidate_from_clobbers
> should ensure that already, even when we don't do anything special about
> "memory" clobber in the patch.
OK.


   Another thing is if there is a store
> in between the two non-volatile asms with "memory" clobber, here I'm not
> sure if with the alternate patch we'd treat the "memory" clobber as use of
> everything previously stored into memory (in this regard the posted version
> is safe).
I woudln't be terribly surprised if DSE isn't safe in this regard.  I 
don't recall CSE doing any kind of dead store elimination so it wouldn't 
likely care that the memory clobber implies a read as well.



> And finally there is the case of non-volatile asm with "memory" clobber with
> no memory stores in between the two - the posted (safer) patch will not
> allow to CSE the two, while in theory we could CSE them into just one asm.
I think we have to assume that CSEing them is wrong.  The first may set 
something in memory that is read by the second.

Thoughts?


Jeff
Segher Boessenkool Jan. 13, 2015, 10:28 p.m. UTC | #5
On Tue, Jan 13, 2015 at 12:45:27PM -0700, Jeff Law wrote:
> On 01/13/15 09:38, Segher Boessenkool wrote:
> >On Tue, Jan 13, 2015 at 05:18:19PM +0100, Jakub Jelinek wrote:
> >>3) on request from Richard (which Segher on IRC argues against), "memory"
> >>    clobber also prevents CSE;
> >
> >As extend.texi used to say:
> >
> >"
> >If your assembler instructions access memory in an unpredictable
> >fashion, add @samp{memory} to the list of clobbered registers.  This
> >causes GCC to not keep memory values cached in registers across the
> >assembler instruction and not optimize stores or loads to that memory.
> >You also should add the @code{volatile} keyword if the memory
> >affected is not listed in the inputs or outputs of the @code{asm}, as
> >the @samp{memory} clobber does not count as a side-effect of the
> >@code{asm}.
> >"
> >
> >so a "memory" clobber in a non-volatile asm should not prevent CSE.
> My reading of that paragraph is somewhat different.

It seems so.

I read that as "GCC can delete a memory clobber if it wants to" (just
like it can delete any other clobber when it wants to).

The only difference between ASM_OPERANDS and any other RTL is that
recog is useless for ASM_OPERANDS, it cannot tell you if after you
modify the construct you are left with something valid; so the only
thing the compiler can change about an asm is to delete it whole.
So unlike most RTL, where the compiler is free to remove a clobber
if what is left is valid RTL, the only way to delete a clobber from
an asm is to delete the whole asm.

> The key here is the memory clobber affects optimization of instructions 
> around the asm while the volatile specifier affects the optimization of 
> the ASM itself.

Those are roughly the effects, yes.  Writing unspecified stuff to
unspecified memory is a pretty heavy hammer ;-)

> A memory clobber must inhibit CSE of memory references on either side of 
> the asm because the asm must be assumed to read or write memory in 
> unpredictable ways.

I don't see how that follows.  The asm itself can be CSEd; its clobber
then disappears in a puff of smoke.

> The volatile specifier tells the compiler that the asm itself must be 
> preserved, even if dataflow shows the outputs as not used.

Yes.


Segher
Segher Boessenkool Jan. 14, 2015, 12:03 a.m. UTC | #6
On Tue, Jan 13, 2015 at 03:17:08PM -0700, Jeff Law wrote:
> >And finally there is the case of non-volatile asm with "memory" clobber 
> >with
> >no memory stores in between the two - the posted (safer) patch will not
> >allow to CSE the two, while in theory we could CSE them into just one asm.
> I think we have to assume that CSEing them is wrong.  The first may set 
> something in memory that is read by the second.
> 
> Thoughts?

I agree with pretty much everything you say in the thread, except for this
idea that a memory clobber reads memory.  No clobber reads anything.

The commit that introduced the memory clobber concept, 426b38c9 (svn 1207),
by rms, has as only comment

	/* `memory', don't cache memory across asm */


Segher
Jeff Law Jan. 14, 2015, 6:16 a.m. UTC | #7
On 01/13/15 17:03, Segher Boessenkool wrote:
> On Tue, Jan 13, 2015 at 03:17:08PM -0700, Jeff Law wrote:
>>> And finally there is the case of non-volatile asm with "memory" clobber
>>> with
>>> no memory stores in between the two - the posted (safer) patch will not
>>> allow to CSE the two, while in theory we could CSE them into just one asm.
>> I think we have to assume that CSEing them is wrong.  The first may set
>> something in memory that is read by the second.
>>
>> Thoughts?
>
> I agree with pretty much everything you say in the thread, except for this
> idea that a memory clobber reads memory.  No clobber reads anything.
>
> The commit that introduced the memory clobber concept, 426b38c9 (svn 1207),
> by rms, has as only comment
>
> 	/* `memory', don't cache memory across asm */
RMS botched this and you can see it in that the scheduler was not 
updated at the same time.  The scheduler absolutely must track if an ASM 
does a memory read of an arbitrary location.  I'd have to dig deeper to 
see when this got fixed, but it was clearly botched.


Many years later another pass which needs to precisely track such things 
came along, namely DSE.

The code in DSE is actually easier to grok.


First, if you look at the ASM handling in cfgexpand.c you'll find:

             if (j == -4)      /* `memory', don't cache memory across asm */
                 {
                   XVECEXP (body, 0, i++)
                     = gen_rtx_CLOBBER (VOIDmode,
                                        gen_rtx_MEM
                                        (BLKmode,
                                         gen_rtx_SCRATCH (VOIDmode)));
                   continue;
                 }


So we generate (CLOBBER (MEM:BLK (SCRATCH))) when we see "memory" in the 
"clobber" list of an ASM.


If you then look at dse.c we have this in record_store:

  /* At this point we know mem is a mem. */
   if (GET_MODE (mem) == BLKmode)
     {
       if (GET_CODE (XEXP (mem, 0)) == SCRATCH)
         {
           if (dump_file && (dump_flags & TDF_DETAILS))
             fprintf (dump_file, " adding wild read for (clobber 
(mem:BLK (scratch))\n");
           add_wild_read (bb_info);
           insn_info->cannot_delete = true;
           return 0;
         }


Which says very precisely that we treat (CLOBBER (MEM:BLK (SCRATCH))) as 
potentially *reading* any location.

If you trace through how the scheduler builds dependencies, paying 
particular attention to alias.c you'll see that (CLOBBER (MEM:BLK 
(SCRATCH))) is treated as both a read and a write of an arbitrary location.

It's unfortunate that RMS put the "memory" tag in the "clobber" list. 
But he really wasn't a compiler junkie and didn't realize the right 
thing to do was to have a memory tag in both the inputs and 
[output|clobber] section to represent a read of an arbitrary location 
and a write to an arbitrary location independently.  But it is what it 
is at this point and we have to treat "memory" appearing in the 
"clobber" list as an arbitrary memory read and an arbitrary memory write.



Jeff
Jeff Law Jan. 14, 2015, 6:42 a.m. UTC | #8
On 01/13/15 09:18, Jakub Jelinek wrote:
> Hi!
>
> My PR60663 fix unfortunately stopped CSE of all inline-asms, even when
> they e.g. only have the clobbers added by default.
>
> This patch attempts to restore the old behavior, with the exceptions:
> 1) as always, asm volatile is not CSEd
> 2) inline-asm with multiple outputs are not CSEd
> 3) on request from Richard (which Segher on IRC argues against), "memory"
>     clobber also prevents CSE; this can be removed by removing the
>     int j, lim = XVECLEN (x, 0); and loop below it
> 4) inline-asm with clobbers is never copied into an insn that wasn't
>     inline-asm before, so if there are clobbers, we allow CSEing of
>     e.g. two same inline-asms, but only by reusing results of one
>     of those
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, tested also
> with arm cross after reverting the PR60663 arm cost fix.
>
> Ok for trunk this way, or with 3) removed?
>
> 2015-01-13  Jakub Jelinek  <jakub@redhat.com>
>
> 	PR rtl-optimization/63637
> 	PR rtl-optimization/60663
> 	* cse.c (merge_equiv_classes): Set new_elt->cost to MAX_COST
> 	if elt->cost is MAX_COST for ASM_OPERANDS.
> 	(find_sets_in_insn): Fix up comment typo.
> 	(cse_insn): Don't set src_volatile for all non-volatile
> 	ASM_OPERANDS in PARALLELs, but just those with multiple outputs
> 	or with "memory" clobber.  Set elt->cost to MAX_COST
> 	for ASM_OPERANDS in PARALLEL.  Set src_elt->cost to MAX_COST
> 	if new_src is ASM_OPERANDS and elt->cost is MAX_COST.
>
> 	* gcc.dg/pr63637-1.c: New test.
> 	* gcc.dg/pr63637-2.c: New test.
> 	* gcc.dg/pr63637-3.c: New test.
> 	* gcc.dg/pr63637-4.c: New test.
> 	* gcc.dg/pr63637-5.c: New test.
> 	* gcc.dg/pr63637-6.c: New test.
> 	* gcc.target/i386/pr63637-1.c: New test.
> 	* gcc.target/i386/pr63637-2.c: New test.
> 	* gcc.target/i386/pr63637-3.c: New test.
> 	* gcc.target/i386/pr63637-4.c: New test.
> 	* gcc.target/i386/pr63637-5.c: New test.
> 	* gcc.target/i386/pr63637-6.c: New test.
>
> --- gcc/cse.c.jj	2015-01-09 21:59:44.000000000 +0100
> +++ gcc/cse.c	2015-01-13 13:26:23.391216064 +0100
> @@ -1792,6 +1792,8 @@ merge_equiv_classes (struct table_elt *c
>   	    }
>   	  new_elt = insert (exp, class1, hash, mode);
>   	  new_elt->in_memory = hash_arg_in_memory;
> +	  if (GET_CODE (exp) == ASM_OPERANDS && elt->cost == MAX_COST)
> +	    new_elt->cost = MAX_COST;
>   	}
>       }
>   }
> @@ -4258,7 +4260,7 @@ find_sets_in_insn (rtx_insn *insn, struc
>       {
>         int i, lim = XVECLEN (x, 0);
>
> -      /* Go over the epressions of the PARALLEL in forward order, to
> +      /* Go over the expressions of the PARALLEL in forward order, to
>   	 put them in the same order in the SETS array.  */
>         for (i = 0; i < lim; i++)
>   	{
> @@ -4634,12 +4636,27 @@ cse_insn (rtx_insn *insn)
>   	  && REGNO (dest) >= FIRST_PSEUDO_REGISTER)
>   	sets[i].src_volatile = 1;
>
> -      /* Also do not record result of a non-volatile inline asm with
> -	 more than one result or with clobbers, we do not want CSE to
> -	 break the inline asm apart.  */
>         else if (GET_CODE (src) == ASM_OPERANDS
>   	       && GET_CODE (x) == PARALLEL)
> -	sets[i].src_volatile = 1;
> +	{
> +	  /* Do not record result of a non-volatile inline asm with
> +	     more than one result.  */
> +	  if (n_sets > 1)
> +	    sets[i].src_volatile = 1;
> +
> +	  int j, lim = XVECLEN (x, 0);
> +	  for (j = 0; j < lim; j++)
> +	    {
> +	      rtx y = XVECEXP (x, 0, j);
> +	      /* And do not record result of a non-volatile inline asm
> +		 with "memory" clobber.  */
> +	      if (GET_CODE (y) == CLOBBER && MEM_P (XEXP (y, 0)))
Can you please add a comment here which references the full form of the 
"memory" tag.  (clobber (mem:BLK (scratch))).

If we ever have to look at this again (say perhaps to break out the read 
anything vs write anything into separate tags :-) it'll save 
considerable time and angst trying to track all this stuff down.

The tests you've got are a step forward, but there's obviously a lot 
more we could do.  For example testing DSE around ASMs without and 
without a memory "clobber", testing CSE of unrelated memory references 
around an ASM without and without a memory clobber come to mind.   You 
don't have to add them to get approval, but if you were to take the time 
to cobble them together it'd be hugely appreciated.


Given the discussion with Segher, let's give him a chance to chime in on 
tonight's messages before we make a final decision.

jeff
Segher Boessenkool Jan. 14, 2015, 3:19 p.m. UTC | #9
On Tue, Jan 13, 2015 at 11:16:09PM -0700, Jeff Law wrote:
> On 01/13/15 17:03, Segher Boessenkool wrote:
> >On Tue, Jan 13, 2015 at 03:17:08PM -0700, Jeff Law wrote:
> >>>And finally there is the case of non-volatile asm with "memory" clobber
> >>>with
> >>>no memory stores in between the two - the posted (safer) patch will not
> >>>allow to CSE the two, while in theory we could CSE them into just one 
> >>>asm.
> >>I think we have to assume that CSEing them is wrong.  The first may set
> >>something in memory that is read by the second.
> >>
> >>Thoughts?
> >
> >I agree with pretty much everything you say in the thread, except for this
> >idea that a memory clobber reads memory.  No clobber reads anything.
> >
> >The commit that introduced the memory clobber concept, 426b38c9 (svn 1207),
> >by rms, has as only comment
> >
> >	/* `memory', don't cache memory across asm */
> RMS botched this and you can see it in that the scheduler was not 
> updated at the same time.  The scheduler absolutely must track if an ASM 
> does a memory read of an arbitrary location.  I'd have to dig deeper to 
> see when this got fixed, but it was clearly botched.
> 
> 
> Many years later another pass which needs to precisely track such things 
> came along, namely DSE.
> 
> The code in DSE is actually easier to grok.
> 
> 
> First, if you look at the ASM handling in cfgexpand.c you'll find:
> 
>             if (j == -4)      /* `memory', don't cache memory across asm */
>                 {
>                   XVECEXP (body, 0, i++)
>                     = gen_rtx_CLOBBER (VOIDmode,
>                                        gen_rtx_MEM
>                                        (BLKmode,
>                                         gen_rtx_SCRATCH (VOIDmode)));
>                   continue;
>                 }

That is in fact the code I mentioned, the first code that was added :-)
It was moved from stmt.c to cfgexpand.c in 2013.

> So we generate (CLOBBER (MEM:BLK (SCRATCH))) when we see "memory" in the 
> "clobber" list of an ASM.

Right.  And we actually have documentation for that construct (woohoo!),
see rtl.texi "Side Effect Expressions":

"
@findex clobber
@item (clobber @var{x})
Represents the storing or possible storing of an unpredictable,
undescribed value into @var{x}, which must be a @code{reg},
@code{scratch}, @code{parallel} or @code{mem} expression.

[...]

If @var{x} is @code{(mem:BLK (const_int 0))} or
@code{(mem:BLK (scratch))}, it means that all memory
locations must be presumed clobbered.
"

Note it doesn't mention reading memory.


> If you then look at dse.c we have this in record_store:
> 
>  /* At this point we know mem is a mem. */
>   if (GET_MODE (mem) == BLKmode)
>     {
>       if (GET_CODE (XEXP (mem, 0)) == SCRATCH)
>         {
>           if (dump_file && (dump_flags & TDF_DETAILS))
>             fprintf (dump_file, " adding wild read for (clobber 
> (mem:BLK (scratch))\n");
>           add_wild_read (bb_info);
>           insn_info->cannot_delete = true;
>           return 0;
>         }
> 
> 
> Which says very precisely that we treat (CLOBBER (MEM:BLK (SCRATCH))) as 
> potentially *reading* any location.
> 
> If you trace through how the scheduler builds dependencies, paying 
> particular attention to alias.c you'll see that (CLOBBER (MEM:BLK 
> (SCRATCH))) is treated as both a read and a write of an arbitrary location.

Many transforms need to treat it like that, sure, because the clobber
only writes some of the memory.  Since you do not know what memory
is and isn't written, all stores before the clobber have to stay before
the clobber, and that is most easily represented as the insn with the
clobber having a read dependency on all memory.


Now if we go back to my earlier quote:

"
If your assembler instructions access memory in an unpredictable
fashion, add @samp{memory} to the list of clobbered registers.  This
causes GCC to not keep memory values cached in registers across the
assembler instruction and not optimize stores or loads to that memory.
You also should add the @code{volatile} keyword if the memory
affected is not listed in the inputs or outputs of the @code{asm}, as
the @samp{memory} clobber does not count as a side-effect of the
@code{asm}.
"

That last line means the compiler is free to delete a non-volatile
asm with a memory clobber if that asm is not needed for dataflow.  Or
that is how I read it; it is trying to indicate that if you want to
prevent the memory clobber from being deleted (together with the rest
of the asm), you need to make the asm volatile.

So as far as I can see the compiler can CSE two identical non-volatile
asms with memory clobber just fine.  Older GCC (I tried 4.7.2) does do
this; current mainline doesn't.  I think it should.


Segher
Jeff Law Jan. 15, 2015, 5:06 a.m. UTC | #10
On 01/14/15 08:19, Segher Boessenkool wrote:

> "
> @findex clobber
> @item (clobber @var{x})
> Represents the storing or possible storing of an unpredictable,
> undescribed value into @var{x}, which must be a @code{reg},
> @code{scratch}, @code{parallel} or @code{mem} expression.
>
> [...]
>
> If @var{x} is @code{(mem:BLK (const_int 0))} or
> @code{(mem:BLK (scratch))}, it means that all memory
> locations must be presumed clobbered.
> "
>
> Note it doesn't mention reading memory.
The documentation is incomplete.  The right thing to do is fix the 
documentation and treat  the "memory" tag appearing in the "clobber" 
section as a read as well as a write.

It's lame, but the historical decision by RMS to put that tag into the 
clobbers section is what it is.  Don't get too hung up on it.  RMS just 
botched it.

>
> Now if we go back to my earlier quote:
>
> "
> If your assembler instructions access memory in an unpredictable
> fashion, add @samp{memory} to the list of clobbered registers.
Note "access" not "write".


  This
> causes GCC to not keep memory values cached in registers across the
> assembler instruction and not optimize stores or loads to that memory.
> You also should add the @code{volatile} keyword if the memory
> affected is not listed in the inputs or outputs of the @code{asm}, as
> the @samp{memory} clobber does not count as a side-effect of the
> @code{asm}.
> "
>
> That last line means the compiler is free to delete a non-volatile
> asm with a memory clobber if that asm is not needed for dataflow.  Or
> that is how I read it; it is trying to indicate that if you want to
> prevent the memory clobber from being deleted (together with the rest
> of the asm), you need to make the asm volatile.
>
> So as far as I can see the compiler can CSE two identical non-volatile
> asms with memory clobber just fine.  Older GCC (I tried 4.7.2) does do
> this; current mainline doesn't.  I think it should.
No, it should not CSE those two cases.  That's simply wrong and if an 
older version did that optimization, that's a bug.

jeff
Richard Biener Jan. 15, 2015, 6:46 a.m. UTC | #11
On January 15, 2015 6:06:33 AM CET, Jeff Law <law@redhat.com> wrote:
>On 01/14/15 08:19, Segher Boessenkool wrote:
>
>> "
>> @findex clobber
>> @item (clobber @var{x})
>> Represents the storing or possible storing of an unpredictable,
>> undescribed value into @var{x}, which must be a @code{reg},
>> @code{scratch}, @code{parallel} or @code{mem} expression.
>>
>> [...]
>>
>> If @var{x} is @code{(mem:BLK (const_int 0))} or
>> @code{(mem:BLK (scratch))}, it means that all memory
>> locations must be presumed clobbered.
>> "
>>
>> Note it doesn't mention reading memory.
>The documentation is incomplete.  The right thing to do is fix the 
>documentation and treat  the "memory" tag appearing in the "clobber" 
>section as a read as well as a write.
>
>It's lame, but the historical decision by RMS to put that tag into the 
>clobbers section is what it is.  Don't get too hung up on it.  RMS just
>
>botched it.
>
>>
>> Now if we go back to my earlier quote:
>>
>> "
>> If your assembler instructions access memory in an unpredictable
>> fashion, add @samp{memory} to the list of clobbered registers.
>Note "access" not "write".
>
>
>  This
>> causes GCC to not keep memory values cached in registers across the
>> assembler instruction and not optimize stores or loads to that
>memory.
>> You also should add the @code{volatile} keyword if the memory
>> affected is not listed in the inputs or outputs of the @code{asm}, as
>> the @samp{memory} clobber does not count as a side-effect of the
>> @code{asm}.
>> "
>>
>> That last line means the compiler is free to delete a non-volatile
>> asm with a memory clobber if that asm is not needed for dataflow.  Or
>> that is how I read it; it is trying to indicate that if you want to
>> prevent the memory clobber from being deleted (together with the rest
>> of the asm), you need to make the asm volatile.
>>
>> So as far as I can see the compiler can CSE two identical
>non-volatile
>> asms with memory clobber just fine.  Older GCC (I tried 4.7.2) does
>do
>> this; current mainline doesn't.  I think it should.
>No, it should not CSE those two cases.  That's simply wrong and if an 
>older version did that optimization, that's a bug.

I think segher has a point here.  If the asm with memory clobber would store to random memory and the point would be to preserve that then the whole distinction with volatile doesn't make much sense (after all without volatile we happily DCE such asm if the regular outputs are not needed).

This doesn't mean 'memory' is a well-designed thing, of course. Just its effects are effectively limited to reads without volatile(?)

Richard.
>
>jeff
Jakub Jelinek Jan. 15, 2015, 8:13 a.m. UTC | #12
On Thu, Jan 15, 2015 at 07:46:18AM +0100, Richard Biener wrote:
> >> That last line means the compiler is free to delete a non-volatile
> >> asm with a memory clobber if that asm is not needed for dataflow.  Or
> >> that is how I read it; it is trying to indicate that if you want to
> >> prevent the memory clobber from being deleted (together with the rest
> >> of the asm), you need to make the asm volatile.
> >>
> >> So as far as I can see the compiler can CSE two identical
> >non-volatile
> >> asms with memory clobber just fine.  Older GCC (I tried 4.7.2) does
> >do
> >> this; current mainline doesn't.  I think it should.
> >No, it should not CSE those two cases.  That's simply wrong and if an 
> >older version did that optimization, that's a bug.
> 
> I think segher has a point here.  If the asm with memory clobber would store to random memory and the point would be to preserve that then the whole distinction with volatile doesn't make much sense (after all without volatile we happily DCE such asm if the regular outputs are not needed).
> 
> This doesn't mean 'memory' is a well-designed thing, of course. Just its
> effects are effectively limited to reads without volatile(?)

Segher's mails talk about "memory" being a write but not read.
If we even can't agree on what non-volatile "memory" means, I think
we should treat it more conservatively, because every user (and there are
lots of people using non-volatile asm with "memory" in the wild) treats it
differently.  Just trying to grep for a few:
glibc:
./sysdeps/alpha/bits/atomic.h:# define atomic_full_barrier()	__asm ("mb" : : : "memory");
./sysdeps/alpha/bits/atomic.h:# define atomic_read_barrier()	__asm ("mb" : : : "memory");
./sysdeps/alpha/bits/atomic.h:# define atomic_write_barrier()	__asm ("wmb" : : : "memory");
./sysdeps/sparc/sparc32/bits/atomic.h:# define atomic_full_barrier() __asm ("" ::: "memory")
./sysdeps/powerpc/powerpc32/bits/atomic.h:# define atomic_read_barrier()	__asm ("lwsync" ::: "memory")
./sysdeps/powerpc/powerpc32/bits/atomic.h:# define atomic_read_barrier()	__asm ("sync" ::: "memory")
./sysdeps/powerpc/powerpc64/bits/atomic.h:#define atomic_read_barrier()	__asm ("lwsync" ::: "memory")
./sysdeps/powerpc/bits/atomic.h:#define atomic_full_barrier()	__asm ("sync" ::: "memory")
./sysdeps/powerpc/bits/atomic.h:#define atomic_write_barrier()	__asm ("eieio" ::: "memory")
./sysdeps/generic/malloc-machine.h:# define atomic_full_barrier() __asm ("" ::: "memory")
./elf/tst-tlsmod3.c:  asm ("" ::: "memory");
./elf/tst-tlsmod4.c:  asm ("" ::: "memory");
./elf/tst-tlsmod1.c:  asm ("" ::: "memory");
./elf/tst-tlsmod2.c:  asm ("" ::: "memory");
./include/atomic.h:# define atomic_full_barrier() __asm ("" ::: "memory")
linux kernel:
./arch/arm/mach-omap2/pm24xx.c:	asm("mcr p15, 0, %0, c7, c0, 4" : : "r" (zero) : "memory", "cc");
./arch/arm/include/asm/irqflags.h:#define local_fiq_enable()  __asm__("cpsie f	@ __stf" : : : "memory", "cc")
./arch/arm/include/asm/irqflags.h:#define local_fiq_disable() __asm__("cpsid f	@ __clf" : : : "memory", "cc")
./arch/x86/include/asm/uaccess_64.h:		asm("":::"memory");
./arch/x86/include/asm/uaccess_64.h:		asm("":::"memory");
./arch/x86/include/asm/segment.h:	asm("mov %%" #seg ",%0":"=r" (value) : : "memory")
./arch/x86/include/asm/stackprotector.h:	asm("mov %0, %%gs" : : "r" (__KERNEL_STACK_CANARY) : "memory");
./arch/arm64/include/asm/irqflags.h:#define local_fiq_enable()	asm("msr	daifclr, #1" : : : "memory")
./arch/arm64/include/asm/irqflags.h:#define local_fiq_disable()	asm("msr	daifset, #1" : : : "memory")
./arch/arm64/include/asm/irqflags.h:#define local_async_enable()	asm("msr	daifclr, #4" : : : "memory")
./arch/arm64/include/asm/irqflags.h:#define local_async_disable()	asm("msr	daifset, #4" : : : "memory")
./arch/tile/lib/memcpy_64.c:				__asm__ ("" : : : "memory");
./arch/tile/lib/memcpy_64.c:				__asm__ ("" : : : "memory");
./arch/tile/include/hv/netio_intf.h:  __asm__("" : : : "memory");
./drivers/net/ethernet/sgi/ioc3-eth.c:	__asm__("sync" ::: "memory")
./lib/sha1.c:  #define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0)

The glibc barriers are supposedly something that can be CSEd (one barrier instead of
two consecutive barriers is enough), but certainly not moved across any loads/stores
in between.  In the kernel case, the enable/disable probably wouldn't allow even CSE.

So I'm with Jeff that we should treat "memory" at least as unspecified read and write,
and whether we can CSE them if there are no memory loads/stores in between them can
be discussed (most likely the kernel would be safe even in that case, because those
usually don't nest and appear in pairs, or act as barriers (like the glibc case),
or read from segment registers (guess again ok to be CSEd with no intervening loads/stores).

In 4.9 backport I'd prefer not to CSE them at all though, stay conservative.

	Jakub
Richard Biener Jan. 15, 2015, 8:22 a.m. UTC | #13
On Thu, 15 Jan 2015, Jakub Jelinek wrote:

> On Thu, Jan 15, 2015 at 07:46:18AM +0100, Richard Biener wrote:
> > >> That last line means the compiler is free to delete a non-volatile
> > >> asm with a memory clobber if that asm is not needed for dataflow.  Or
> > >> that is how I read it; it is trying to indicate that if you want to
> > >> prevent the memory clobber from being deleted (together with the rest
> > >> of the asm), you need to make the asm volatile.
> > >>
> > >> So as far as I can see the compiler can CSE two identical
> > >non-volatile
> > >> asms with memory clobber just fine.  Older GCC (I tried 4.7.2) does
> > >do
> > >> this; current mainline doesn't.  I think it should.
> > >No, it should not CSE those two cases.  That's simply wrong and if an 
> > >older version did that optimization, that's a bug.
> > 
> > I think segher has a point here.  If the asm with memory clobber would store to random memory and the point would be to preserve that then the whole distinction with volatile doesn't make much sense (after all without volatile we happily DCE such asm if the regular outputs are not needed).
> > 
> > This doesn't mean 'memory' is a well-designed thing, of course. Just its
> > effects are effectively limited to reads without volatile(?)
> 
> Segher's mails talk about "memory" being a write but not read.
> If we even can't agree on what non-volatile "memory" means, I think
> we should treat it more conservatively, because every user (and there are
> lots of people using non-volatile asm with "memory" in the wild) treats it
> differently.  Just trying to grep for a few:
> glibc:
> ./sysdeps/alpha/bits/atomic.h:# define atomic_full_barrier()	__asm ("mb" : : : "memory");
> ./sysdeps/alpha/bits/atomic.h:# define atomic_read_barrier()	__asm ("mb" : : : "memory");
> ./sysdeps/alpha/bits/atomic.h:# define atomic_write_barrier()	__asm ("wmb" : : : "memory");
> ./sysdeps/sparc/sparc32/bits/atomic.h:# define atomic_full_barrier() __asm ("" ::: "memory")
> ./sysdeps/powerpc/powerpc32/bits/atomic.h:# define atomic_read_barrier()	__asm ("lwsync" ::: "memory")
> ./sysdeps/powerpc/powerpc32/bits/atomic.h:# define atomic_read_barrier()	__asm ("sync" ::: "memory")
> ./sysdeps/powerpc/powerpc64/bits/atomic.h:#define atomic_read_barrier()	__asm ("lwsync" ::: "memory")
> ./sysdeps/powerpc/bits/atomic.h:#define atomic_full_barrier()	__asm ("sync" ::: "memory")
> ./sysdeps/powerpc/bits/atomic.h:#define atomic_write_barrier()	__asm ("eieio" ::: "memory")
> ./sysdeps/generic/malloc-machine.h:# define atomic_full_barrier() __asm ("" ::: "memory")
> ./elf/tst-tlsmod3.c:  asm ("" ::: "memory");
> ./elf/tst-tlsmod4.c:  asm ("" ::: "memory");
> ./elf/tst-tlsmod1.c:  asm ("" ::: "memory");
> ./elf/tst-tlsmod2.c:  asm ("" ::: "memory");
> ./include/atomic.h:# define atomic_full_barrier() __asm ("" ::: "memory")
> linux kernel:
> ./arch/arm/mach-omap2/pm24xx.c:	asm("mcr p15, 0, %0, c7, c0, 4" : : "r" (zero) : "memory", "cc");
> ./arch/arm/include/asm/irqflags.h:#define local_fiq_enable()  __asm__("cpsie f	@ __stf" : : : "memory", "cc")
> ./arch/arm/include/asm/irqflags.h:#define local_fiq_disable() __asm__("cpsid f	@ __clf" : : : "memory", "cc")
> ./arch/x86/include/asm/uaccess_64.h:		asm("":::"memory");
> ./arch/x86/include/asm/uaccess_64.h:		asm("":::"memory");
> ./arch/x86/include/asm/segment.h:	asm("mov %%" #seg ",%0":"=r" (value) : : "memory")
> ./arch/x86/include/asm/stackprotector.h:	asm("mov %0, %%gs" : : "r" (__KERNEL_STACK_CANARY) : "memory");
> ./arch/arm64/include/asm/irqflags.h:#define local_fiq_enable()	asm("msr	daifclr, #1" : : : "memory")
> ./arch/arm64/include/asm/irqflags.h:#define local_fiq_disable()	asm("msr	daifset, #1" : : : "memory")
> ./arch/arm64/include/asm/irqflags.h:#define local_async_enable()	asm("msr	daifclr, #4" : : : "memory")
> ./arch/arm64/include/asm/irqflags.h:#define local_async_disable()	asm("msr	daifset, #4" : : : "memory")
> ./arch/tile/lib/memcpy_64.c:				__asm__ ("" : : : "memory");
> ./arch/tile/lib/memcpy_64.c:				__asm__ ("" : : : "memory");
> ./arch/tile/include/hv/netio_intf.h:  __asm__("" : : : "memory");
> ./drivers/net/ethernet/sgi/ioc3-eth.c:	__asm__("sync" ::: "memory")
> ./lib/sha1.c:  #define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0)
> 
> The glibc barriers are supposedly something that can be CSEd (one barrier instead of
> two consecutive barriers is enough), but certainly not moved across any loads/stores
> in between.  In the kernel case, the enable/disable probably wouldn't allow even CSE.
> 
> So I'm with Jeff that we should treat "memory" at least as unspecified read and write,
> and whether we can CSE them if there are no memory loads/stores in between them can
> be discussed (most likely the kernel would be safe even in that case, because those
> usually don't nest and appear in pairs, or act as barriers (like the glibc case),
> or read from segment registers (guess again ok to be CSEd with no intervening loads/stores).
> 
> In 4.9 backport I'd prefer not to CSE them at all though, stay conservative.

Sure - I also requested "memory" to be not CSEd just to be conservative,
not because I fully understood its meaning.

Richard.
Jakub Jelinek Jan. 15, 2015, 8:43 a.m. UTC | #14
On Thu, Jan 15, 2015 at 09:13:30AM +0100, Jakub Jelinek wrote:
> differently.  Just trying to grep for a few:
> glibc:
> ./sysdeps/alpha/bits/atomic.h:# define atomic_full_barrier()	__asm ("mb" : : : "memory");
> ./sysdeps/alpha/bits/atomic.h:# define atomic_read_barrier()	__asm ("mb" : : : "memory");
> ./sysdeps/alpha/bits/atomic.h:# define atomic_write_barrier()	__asm ("wmb" : : : "memory");
> ./sysdeps/sparc/sparc32/bits/atomic.h:# define atomic_full_barrier() __asm ("" ::: "memory")
> ./sysdeps/powerpc/powerpc32/bits/atomic.h:# define atomic_read_barrier()	__asm ("lwsync" ::: "memory")
> ./sysdeps/powerpc/powerpc32/bits/atomic.h:# define atomic_read_barrier()	__asm ("sync" ::: "memory")
> ./sysdeps/powerpc/powerpc64/bits/atomic.h:#define atomic_read_barrier()	__asm ("lwsync" ::: "memory")
> ./sysdeps/powerpc/bits/atomic.h:#define atomic_full_barrier()	__asm ("sync" ::: "memory")
> ./sysdeps/powerpc/bits/atomic.h:#define atomic_write_barrier()	__asm ("eieio" ::: "memory")
> ./sysdeps/generic/malloc-machine.h:# define atomic_full_barrier() __asm ("" ::: "memory")
> ./elf/tst-tlsmod3.c:  asm ("" ::: "memory");
> ./elf/tst-tlsmod4.c:  asm ("" ::: "memory");
> ./elf/tst-tlsmod1.c:  asm ("" ::: "memory");
> ./elf/tst-tlsmod2.c:  asm ("" ::: "memory");
> ./include/atomic.h:# define atomic_full_barrier() __asm ("" ::: "memory")
> linux kernel:
> ./arch/arm/mach-omap2/pm24xx.c:	asm("mcr p15, 0, %0, c7, c0, 4" : : "r" (zero) : "memory", "cc");
> ./arch/arm/include/asm/irqflags.h:#define local_fiq_enable()  __asm__("cpsie f	@ __stf" : : : "memory", "cc")
> ./arch/arm/include/asm/irqflags.h:#define local_fiq_disable() __asm__("cpsid f	@ __clf" : : : "memory", "cc")
> ./arch/x86/include/asm/uaccess_64.h:		asm("":::"memory");
> ./arch/x86/include/asm/uaccess_64.h:		asm("":::"memory");
> ./arch/x86/include/asm/segment.h:	asm("mov %%" #seg ",%0":"=r" (value) : : "memory")
> ./arch/x86/include/asm/stackprotector.h:	asm("mov %0, %%gs" : : "r" (__KERNEL_STACK_CANARY) : "memory");
> ./arch/arm64/include/asm/irqflags.h:#define local_fiq_enable()	asm("msr	daifclr, #1" : : : "memory")
> ./arch/arm64/include/asm/irqflags.h:#define local_fiq_disable()	asm("msr	daifset, #1" : : : "memory")
> ./arch/arm64/include/asm/irqflags.h:#define local_async_enable()	asm("msr	daifclr, #4" : : : "memory")
> ./arch/arm64/include/asm/irqflags.h:#define local_async_disable()	asm("msr	daifset, #4" : : : "memory")
> ./arch/tile/lib/memcpy_64.c:				__asm__ ("" : : : "memory");
> ./arch/tile/lib/memcpy_64.c:				__asm__ ("" : : : "memory");
> ./arch/tile/include/hv/netio_intf.h:  __asm__("" : : : "memory");
> ./drivers/net/ethernet/sgi/ioc3-eth.c:	__asm__("sync" ::: "memory")
> ./lib/sha1.c:  #define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0)

Oops, bad grep, only the 
./arch/x86/include/asm/segment.h:     asm("mov %%" #seg ",%0":"=r" (value) : : "memory")
is what we are talking about actually, asm without outputs is implicitly
volatile.

	Jakub
Jeff Law Jan. 15, 2015, 6:02 p.m. UTC | #15
On 01/15/15 01:13, Jakub Jelinek wrote:
>
> The glibc barriers are supposedly something that can be CSEd (one barrier instead of
> two consecutive barriers is enough), but certainly not moved across any loads/stores
> in between.  In the kernel case, the enable/disable probably wouldn't allow even CSE.
>
> So I'm with Jeff that we should treat "memory" at least as unspecified read and write,
> and whether we can CSE them if there are no memory loads/stores in between them can
> be discussed (most likely the kernel would be safe even in that case, because those
> usually don't nest and appear in pairs, or act as barriers (like the glibc case),
> or read from segment registers (guess again ok to be CSEd with no intervening loads/stores).
>
> In 4.9 backport I'd prefer not to CSE them at all though, stay conservative.
My vote would be to go conservative.  For gcc6 consider allowing a 
"memory" tag in the inputs and outputs to specify a read of any memory 
location and write of any memory location respectively.  A "memory" tag 
in the clobbers would maintain the conservative behaviour.

jeff
Richard Henderson Jan. 23, 2015, 8:52 p.m. UTC | #16
On 01/15/2015 12:13 AM, Jakub Jelinek wrote:
> On Thu, Jan 15, 2015 at 07:46:18AM +0100, Richard Biener wrote:
>>>> That last line means the compiler is free to delete a non-volatile
>>>> asm with a memory clobber if that asm is not needed for dataflow.  Or
>>>> that is how I read it; it is trying to indicate that if you want to
>>>> prevent the memory clobber from being deleted (together with the rest
>>>> of the asm), you need to make the asm volatile.
>>>>
>>>> So as far as I can see the compiler can CSE two identical
>>> non-volatile
>>>> asms with memory clobber just fine.  Older GCC (I tried 4.7.2) does
>>> do
>>>> this; current mainline doesn't.  I think it should.
>>> No, it should not CSE those two cases.  That's simply wrong and if an 
>>> older version did that optimization, that's a bug.
>>
>> I think segher has a point here.  If the asm with memory clobber would store to random memory and the point would be to preserve that then the whole distinction with volatile doesn't make much sense (after all without volatile we happily DCE such asm if the regular outputs are not needed).
>>
>> This doesn't mean 'memory' is a well-designed thing, of course. Just its
>> effects are effectively limited to reads without volatile(?)
> 
> Segher's mails talk about "memory" being a write but not read.
> If we even can't agree on what non-volatile "memory" means, I think
> we should treat it more conservatively, because every user (and there are
> lots of people using non-volatile asm with "memory" in the wild) treats it
> differently.  Just trying to grep for a few:
> glibc:
> ./sysdeps/alpha/bits/atomic.h:# define atomic_full_barrier()	__asm ("mb" : : : "memory");
> ./sysdeps/alpha/bits/atomic.h:# define atomic_read_barrier()	__asm ("mb" : : : "memory");
> ./sysdeps/alpha/bits/atomic.h:# define atomic_write_barrier()	__asm ("wmb" : : : "memory");
> ./sysdeps/sparc/sparc32/bits/atomic.h:# define atomic_full_barrier() __asm ("" ::: "memory")
> ./sysdeps/powerpc/powerpc32/bits/atomic.h:# define atomic_read_barrier()	__asm ("lwsync" ::: "memory")
> ./sysdeps/powerpc/powerpc32/bits/atomic.h:# define atomic_read_barrier()	__asm ("sync" ::: "memory")
> ./sysdeps/powerpc/powerpc64/bits/atomic.h:#define atomic_read_barrier()	__asm ("lwsync" ::: "memory")
> ./sysdeps/powerpc/bits/atomic.h:#define atomic_full_barrier()	__asm ("sync" ::: "memory")
> ./sysdeps/powerpc/bits/atomic.h:#define atomic_write_barrier()	__asm ("eieio" ::: "memory")
> ./sysdeps/generic/malloc-machine.h:# define atomic_full_barrier() __asm ("" ::: "memory")

I think that it's uses like these -- which may well have been written
by folks that also work on gcc -- that are proof that we have at least
intended to support a memory clobber to be a full read+write barrier,
and thus we must consider a memory clobber to be both a read and write.

(The fact that all of these are automatically volatile and would never be CSEd
is beside the point.  If the semantics of a memory clobber differ based on the
volatile flag on the asm, I think that would be too ill-defined to actually
support.)

In the interest of progressing wrt the current regression, I think that
Jakub's patch should go in as-is for now, and then we iterate on how we
think the memory cse ought (or ought not) to occur.

As for my own thoughts on whether two non-volatile asms with memory clobbers
should be CSE'd, in absence of other stores to memory in between, are
complicated and probably not well-formed.  I'll think about it some more.


r~
Segher Boessenkool Jan. 23, 2015, 9:39 p.m. UTC | #17
On Fri, Jan 23, 2015 at 12:52:37PM -0800, Richard Henderson wrote:
> > ./sysdeps/generic/malloc-machine.h:# define atomic_full_barrier() __asm ("" ::: "memory")
> 
> I think that it's uses like these -- which may well have been written
> by folks that also work on gcc -- that are proof that we have at least
> intended to support a memory clobber to be a full read+write barrier,
> and thus we must consider a memory clobber to be both a read and write.

I understand that argument.  But it is not what GCC actually does, nor
what I think it should do.  Consider this program:

--- 8< ---
int main(void)
{
	int x[100], y[100];

	x[31] = 42;

	asm("# eww %0" : "=m"(y[4]) : : "memory");

	return 0;
}
--- 8< ---

If "memory" would mean a read, the store to x[31] should not be considered
dead.  But it is, by all versions of GCC I have tried.

Because "memory" means a clobber of unspecified memory, all accesses *that
do happen* before it stay before it, and all accesses after it, after it.
It conflicts with all other memory accesses.  But it is a clobber, not an
output nor an input.

Here is another program:

--- 8< ---
int main(void)
{
	int x;

	asm("# eww %0" : "=r"(x) : : "memory");
	asm("# eww %0" : "=r"(x) : : "memory");

	return x;
}
--- 8< ---

If "memory" would imply a write and a read, the identical asm here could
not be CSEd.  But it is.

> (The fact that all of these are automatically volatile and would never be CSEd
> is beside the point.  If the semantics of a memory clobber differ based on the
> volatile flag on the asm, I think that would be too ill-defined to actually
> support.)

The semantics of the memory clobber do not change.  The semantics of the
asm as a whole do though: any volatile has to be executed on the real machine
as on the abstract machine.


You could argue (and it seems you do) that we should change the current
semantics of the "memory" clobber to do imply reading any memory; I argue
that that would be a bad idea, see the first example code.


Segher
Jakub Jelinek Jan. 23, 2015, 9:48 p.m. UTC | #18
On Fri, Jan 23, 2015 at 03:39:40PM -0600, Segher Boessenkool wrote:
> I understand that argument.  But it is not what GCC actually does, nor
> what I think it should do.  Consider this program:
> 
> --- 8< ---
> int main(void)
> {
> 	int x[100], y[100];
> 
> 	x[31] = 42;
> 
> 	asm("# eww %0" : "=m"(y[4]) : : "memory");
> 
> 	return 0;
> }
> --- 8< ---

Here x isn't addressable, so it is certainly fine to DSE it.
x shouldn't be considered memory.
If the address of x escaped, either to the assembly or to some global var
etc., then it probably shouldn't be removed.

> Here is another program:
> 
> --- 8< ---
> int main(void)
> {
> 	int x;
> 
> 	asm("# eww %0" : "=r"(x) : : "memory");
> 	asm("# eww %0" : "=r"(x) : : "memory");
> 
> 	return x;
> }
> --- 8< ---
> 
> If "memory" would imply a write and a read, the identical asm here could
> not be CSEd.  But it is.

Certainly not in GCC 4.9 nor trunk.  I've committed the patch because it
makes GCC more aggressive again, just not for the "memory" case.
In case of two idential non-volatile asms with "memory" clobber,
if there are no intervening memory reads or writes, we can talk about
allowing that to be CSEd despite "memory" being considered unspecified
read and write.  If there are stores in between, we certainly should not CSE.

	Jakub
Segher Boessenkool Jan. 24, 2015, 1:18 a.m. UTC | #19
On Fri, Jan 23, 2015 at 10:48:50PM +0100, Jakub Jelinek wrote:
> On Fri, Jan 23, 2015 at 03:39:40PM -0600, Segher Boessenkool wrote:
> > I understand that argument.  But it is not what GCC actually does, nor
> > what I think it should do.  Consider this program:
> > 
> > --- 8< ---
> > int main(void)
> > {
> > 	int x[100], y[100];
> > 
> > 	x[31] = 42;
> > 
> > 	asm("# eww %0" : "=m"(y[4]) : : "memory");
> > 
> > 	return 0;
> > }
> > --- 8< ---
> 
> Here x isn't addressable, so it is certainly fine to DSE it.
> x shouldn't be considered memory.
> If the address of x escaped, either to the assembly or to some global var
> etc., then it probably shouldn't be removed.

But GCC does consider it memory.  If you look at the (tree) dump files
you see both arrays are clobbered after the asm.  Tree DCE removes the
store to x[31] nevertheless.

If the address of x escapes then of course the store to x[31] should
not be removed, irrespective of whether the clobber implies a read
or not.

> > Here is another program:
> > 
> > --- 8< ---
> > int main(void)
> > {
> > 	int x;
> > 
> > 	asm("# eww %0" : "=r"(x) : : "memory");
> > 	asm("# eww %0" : "=r"(x) : : "memory");
> > 
> > 	return x;
> > }
> > --- 8< ---
> > 
> > If "memory" would imply a write and a read, the identical asm here could
> > not be CSEd.  But it is.
> 
> Certainly not in GCC 4.9 nor trunk.  I've committed the patch because it
> makes GCC more aggressive again, just not for the "memory" case.

Yes, sorry, it is not actually removed by the CSE pass, but much much earlier
("deleted 2 trivially dead insns" in the early RTL "jump" pass).  My point
is that much of the compiler does not agree that "memory" implies a read.

But, to show RTL CSE removes the redundant insn here with older compilers:

--- 8< ---
int main(void)
{
	int x1, x2;

	asm("# eww %0" : "=r"(x1) : : "memory");
	asm("# eww %0" : "=r"(x2) : : "memory");

	return x1 + x2;
}
--- 8< ---

You can argue CSE should not do that.  But other passes feel free to
remove one of the asms (in the previous program), and they should not
either if "memory" implies a read and a write.

> In case of two idential non-volatile asms with "memory" clobber,
> if there are no intervening memory reads or writes, we can talk about
> allowing that to be CSEd despite "memory" being considered unspecified
> read and write.  If there are stores in between, we certainly should not CSE.

If there is a store inbetween, we cannot remove the later asm, which is
what CSE would do AFAIK.  But we can remove the *earlier* asm (and, in
fact, GCC does do that).


Segher
Richard Sandiford Jan. 24, 2015, 11:53 a.m. UTC | #20
Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Fri, Jan 23, 2015 at 10:48:50PM +0100, Jakub Jelinek wrote:
>> On Fri, Jan 23, 2015 at 03:39:40PM -0600, Segher Boessenkool wrote:
>> > I understand that argument.  But it is not what GCC actually does, nor
>> > what I think it should do.  Consider this program:
>> > 
>> > --- 8< ---
>> > int main(void)
>> > {
>> > 	int x[100], y[100];
>> > 
>> > 	x[31] = 42;
>> > 
>> > 	asm("# eww %0" : "=m"(y[4]) : : "memory");
>> > 
>> > 	return 0;
>> > }
>> > --- 8< ---
>> 
>> Here x isn't addressable, so it is certainly fine to DSE it.
>> x shouldn't be considered memory.
>> If the address of x escaped, either to the assembly or to some global var
>> etc., then it probably shouldn't be removed.
>
> But GCC does consider it memory.  If you look at the (tree) dump files
> you see both arrays are clobbered after the asm.  Tree DCE removes the
> store to x[31] nevertheless.
>
> If the address of x escapes then of course the store to x[31] should
> not be removed, irrespective of whether the clobber implies a read
> or not.

Just tried some other examples out of curiosity.  In:

int main(void)
{
  int x[100], y[100];

  asm volatile("# foo" :: "r"(x));
  x[31] = 42;
  asm("# eww %0" : "=m"(y[4]) : : "memory");

  return 0;
}

"x[31]" can only validly escape to the second asm.  In this case the
assignment is kept, as it is with:

int main(void)
{
  int x[100], y;

  asm volatile("# foo" :: "r"(x));
  x[31] = 42;
  asm("# eww %0" : "=r"(y) : : "memory");

  return y;
}

But remove the clobber and it goes away:

int main(void)
{
  int x[100], y;

  asm volatile("# foo" :: "r"(x));
  x[31] = 42;
  asm("# eww %0" : "=r"(y));

  return y;
}

So it looks like these four cases (including yours) are handled correctly.

Thanks,
Richard
diff mbox

Patch

--- gcc/cse.c.jj	2015-01-09 21:59:44.000000000 +0100
+++ gcc/cse.c	2015-01-13 13:26:23.391216064 +0100
@@ -1792,6 +1792,8 @@  merge_equiv_classes (struct table_elt *c
 	    }
 	  new_elt = insert (exp, class1, hash, mode);
 	  new_elt->in_memory = hash_arg_in_memory;
+	  if (GET_CODE (exp) == ASM_OPERANDS && elt->cost == MAX_COST)
+	    new_elt->cost = MAX_COST;
 	}
     }
 }
@@ -4258,7 +4260,7 @@  find_sets_in_insn (rtx_insn *insn, struc
     {
       int i, lim = XVECLEN (x, 0);
 
-      /* Go over the epressions of the PARALLEL in forward order, to
+      /* Go over the expressions of the PARALLEL in forward order, to
 	 put them in the same order in the SETS array.  */
       for (i = 0; i < lim; i++)
 	{
@@ -4634,12 +4636,27 @@  cse_insn (rtx_insn *insn)
 	  && REGNO (dest) >= FIRST_PSEUDO_REGISTER)
 	sets[i].src_volatile = 1;
 
-      /* Also do not record result of a non-volatile inline asm with
-	 more than one result or with clobbers, we do not want CSE to
-	 break the inline asm apart.  */
       else if (GET_CODE (src) == ASM_OPERANDS
 	       && GET_CODE (x) == PARALLEL)
-	sets[i].src_volatile = 1;
+	{
+	  /* Do not record result of a non-volatile inline asm with
+	     more than one result.  */
+	  if (n_sets > 1)
+	    sets[i].src_volatile = 1;
+
+	  int j, lim = XVECLEN (x, 0);
+	  for (j = 0; j < lim; j++)
+	    {
+	      rtx y = XVECEXP (x, 0, j);
+	      /* And do not record result of a non-volatile inline asm
+		 with "memory" clobber.  */
+	      if (GET_CODE (y) == CLOBBER && MEM_P (XEXP (y, 0)))
+		{
+		  sets[i].src_volatile = 1;
+		  break;
+		}
+	    }
+	}
 
 #if 0
       /* It is no longer clear why we used to do this, but it doesn't
@@ -5230,8 +5247,8 @@  cse_insn (rtx_insn *insn)
 	    ;
 
 	  /* Look for a substitution that makes a valid insn.  */
-	  else if (validate_unshare_change
-		     (insn, &SET_SRC (sets[i].rtl), trial, 0))
+	  else if (validate_unshare_change (insn, &SET_SRC (sets[i].rtl),
+					    trial, 0))
 	    {
 	      rtx new_rtx = canon_reg (SET_SRC (sets[i].rtl), insn);
 
@@ -5593,6 +5610,12 @@  cse_insn (rtx_insn *insn)
 		  }
 		elt = insert (src, classp, sets[i].src_hash, mode);
 		elt->in_memory = sets[i].src_in_memory;
+		/* If inline asm has any clobbers, ensure we only reuse
+		   existing inline asms and never try to put the ASM_OPERANDS
+		   into an insn that isn't inline asm.  */
+		if (GET_CODE (src) == ASM_OPERANDS
+		    && GET_CODE (x) == PARALLEL)
+		  elt->cost = MAX_COST;
 		sets[i].src_elt = classp = elt;
 	      }
 	    if (sets[i].src_const && sets[i].src_const_elt == 0
@@ -5906,6 +5929,9 @@  cse_insn (rtx_insn *insn)
 		      }
 		    src_elt = insert (new_src, classp, src_hash, new_mode);
 		    src_elt->in_memory = elt->in_memory;
+		    if (GET_CODE (new_src) == ASM_OPERANDS
+			&& elt->cost == MAX_COST)
+		      src_elt->cost = MAX_COST;
 		  }
 		else if (classp && classp != src_elt->first_same_value)
 		  /* Show that two things that we've seen before are
--- gcc/testsuite/gcc.dg/pr63637-1.c.jj	2015-01-13 13:40:56.385782037 +0100
+++ gcc/testsuite/gcc.dg/pr63637-1.c	2015-01-13 13:41:08.931559978 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c;
+  asm ("# Magic instruction" : "=r" (a));
+  asm ("# Magic instruction" : "=r" (b));
+  asm ("# Magic instruction" : "=r" (c));
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 1 } } */
--- gcc/testsuite/gcc.dg/pr63637-2.c.jj	2015-01-13 13:41:36.967063752 +0100
+++ gcc/testsuite/gcc.dg/pr63637-2.c	2015-01-13 13:42:04.758571844 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c;
+  asm ("# Magic instruction" : "=r" (a) : "r" (0));
+  asm ("# Magic instruction" : "=r" (b) : "r" (0));
+  asm ("# Magic instruction" : "=r" (c) : "r" (0));
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 1 } } */
--- gcc/testsuite/gcc.dg/pr63637-3.c.jj	2015-01-13 13:43:58.820552956 +0100
+++ gcc/testsuite/gcc.dg/pr63637-3.c	2015-01-13 13:44:21.702147954 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c;
+  asm ("# Magic instruction" : "=r" (a) : : "memory");
+  asm ("# Magic instruction" : "=r" (b) : : "memory");
+  asm ("# Magic instruction" : "=r" (c) : : "memory");
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 3 } } */
--- gcc/testsuite/gcc.dg/pr63637-4.c.jj	2015-01-13 13:44:01.624503326 +0100
+++ gcc/testsuite/gcc.dg/pr63637-4.c	2015-01-13 13:44:44.220749376 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c;
+  asm ("# Magic instruction" : "=r" (a) : "r" (0) : "memory");
+  asm ("# Magic instruction" : "=r" (b) : "r" (0) : "memory");
+  asm ("# Magic instruction" : "=r" (c) : "r" (0) : "memory");
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 3 } } */
--- gcc/testsuite/gcc.dg/pr63637-5.c.jj	2015-01-13 13:46:36.837756064 +0100
+++ gcc/testsuite/gcc.dg/pr63637-5.c	2015-01-13 13:47:01.461320229 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c, d, e, f;
+  asm ("# Magic instruction" : "=r" (a), "=r" (d));
+  asm ("# Magic instruction" : "=r" (b), "=r" (e));
+  asm ("# Magic instruction" : "=r" (c), "=r" (f));
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 3 } } */
--- gcc/testsuite/gcc.dg/pr63637-6.c.jj	2015-01-13 13:46:39.834703018 +0100
+++ gcc/testsuite/gcc.dg/pr63637-6.c	2015-01-13 13:47:27.915851986 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c, d, e, f;
+  asm ("# Magic instruction" : "=r" (a), "=r" (d) : "r" (0));
+  asm ("# Magic instruction" : "=r" (b), "=r" (e) : "r" (0));
+  asm ("# Magic instruction" : "=r" (c), "=r" (f) : "r" (0));
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 3 } } */
--- gcc/testsuite/gcc.target/i386/pr63637-1.c.jj	2015-01-13 13:40:13.996531691 +0100
+++ gcc/testsuite/gcc.target/i386/pr63637-1.c	2015-01-13 13:42:37.945984430 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c;
+  asm ("# Magic instruction" : "=r" (a) : : "eax");
+  asm ("# Magic instruction" : "=r" (b) : : "edx");
+  asm ("# Magic instruction" : "=r" (c) : : "ecx");
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 1 } } */
--- gcc/testsuite/gcc.target/i386/pr63637-2.c.jj	2015-01-13 13:42:12.557433805 +0100
+++ gcc/testsuite/gcc.target/i386/pr63637-2.c	2015-01-13 13:42:30.656113460 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c;
+  asm ("# Magic instruction" : "=r" (a) : "r" (0) : "eax");
+  asm ("# Magic instruction" : "=r" (b) : "r" (0) : "edx");
+  asm ("# Magic instruction" : "=r" (c) : "r" (0) : "ecx");
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 1 } } */
--- gcc/testsuite/gcc.target/i386/pr63637-3.c.jj	2015-01-13 13:43:06.407480663 +0100
+++ gcc/testsuite/gcc.target/i386/pr63637-3.c	2015-01-13 13:43:28.600087856 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c;
+  asm ("# Magic instruction" : "=r" (a) : : "eax", "memory");
+  asm ("# Magic instruction" : "=r" (b) : : "edx", "memory");
+  asm ("# Magic instruction" : "=r" (c) : : "ecx", "memory");
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 3 } } */
--- gcc/testsuite/gcc.target/i386/pr63637-4.c.jj	2015-01-13 13:43:09.505425830 +0100
+++ gcc/testsuite/gcc.target/i386/pr63637-4.c	2015-01-13 13:43:44.769801653 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c;
+  asm ("# Magic instruction" : "=r" (a) : "r" (0) : "eax", "memory");
+  asm ("# Magic instruction" : "=r" (b) : "r" (0) : "edx", "memory");
+  asm ("# Magic instruction" : "=r" (c) : "r" (0) : "ecx", "memory");
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 3 } } */
--- gcc/testsuite/gcc.target/i386/pr63637-5.c.jj	2015-01-13 13:45:38.747784252 +0100
+++ gcc/testsuite/gcc.target/i386/pr63637-5.c	2015-01-13 13:45:34.350862077 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c, d, e, f;
+  asm ("# Magic instruction" : "=r" (a), "=r" (d) : : "eax");
+  asm ("# Magic instruction" : "=r" (b), "=r" (e) : : "edx");
+  asm ("# Magic instruction" : "=r" (c), "=r" (f) : : "ecx");
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 3 } } */
--- gcc/testsuite/gcc.target/i386/pr63637-6.c.jj	2015-01-13 13:45:54.923497943 +0100
+++ gcc/testsuite/gcc.target/i386/pr63637-6.c	2015-01-13 13:46:23.965983893 +0100
@@ -0,0 +1,15 @@ 
+/* PR rtl-optimization/63637 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (void)
+{
+  int a, b, c, d, e, f;
+  asm ("# Magic instruction" : "=r" (a), "=r" (d) : "r" (0) : "eax");
+  asm ("# Magic instruction" : "=r" (b), "=r" (e) : "r" (0) : "edx");
+  asm ("# Magic instruction" : "=r" (c), "=r" (f) : "r" (0) : "ecx");
+  return a + b + c;
+}
+
+/* { dg-final { scan-assembler-times "# Magic instruction" 3 } } */