diff mbox series

fold, simplify-rtx: Punt on non-representable floating point constants [PR104522]

Message ID 20220215095804.GL2646553@tucnak
State New
Headers show
Series fold, simplify-rtx: Punt on non-representable floating point constants [PR104522] | expand

Commit Message

Jakub Jelinek Feb. 15, 2022, 9:58 a.m. UTC
Hi!

For IBM double double I've added in PR95450 and PR99648 verification that
when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
or CONST_DOUBLE constant, we try to encode it back to target bytes and
verify it is the same.
This is because our real.c support isn't able to represent all valid values
of IBM double double which has variable precision.
In PR104522, it has been noted that we have similar problem with the
Intel/Motorola extended XFmode formats, our internal representation isn't
able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
values.
So, the following patch is an attempt to extend that verification to all
floats.
Unfortunately, it wasn't that straightforward, because the
__builtin_clear_padding code exactly for the XFmode long doubles needs to
discover what bits are padding and does that by interpreting memory of
all 1s.  That is actually a valid supported value, a qNaN with negative
sign with all mantissa bits set, but the verification includes also the
padding bits (exactly what __builtin_clear_padding wants to figure out)
and so fails the comparison check and so we ICE.
The patch fixes that case by moving that verification from
native_interpret_real to its caller, so that clear_padding_type can
call native_interpret_real and avoid that extra check.

With this, the only thing that regresses in the testsuite is
+FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
because it decides to use a pattern that has non-zero bits in the padding
bits of the long double, so the simplify-rtx.cc change prevents folding
a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
code at all opt levels) something like:
        movabsq $-72340172838076674, %rax
        movabsq $-72340172838076674, %rdx
        movq    %rax, -48(%rbp)
        movq    %rdx, -40(%rbp)
        fldt    -48(%rbp)
        fstpt   -32(%rbp)
instead of
        fldt    .LC2(%rip)
        fstpt   -32(%rbp)
...
.LC2:
        .long   -16843010
        .long   -16843010
        .long   65278
        .long   0
Note, neither of those sequences actually stores the padding bits, fstpt
simply doesn't touch them.
For vars with clear_padding_real_needs_padding_p types that are allocated
to memory at expansion time, I'd say much better would be to do the stores
using integral modes rather than XFmode, so do that:
        movabsq $-72340172838076674, %rax
	movq	%rax, -32(%rbp)
	movq	%rax, -24(%rbp)
directly.  That is the only way to ensure the padding bits are initialized
(or expand __builtin_clear_padding, but then you initialize separately the
value bits and padding bits).

Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
above, the gcc.target/i386/auto-init-4.c case is unresolved.

2022-02-15  Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/104522
	* fold-const.h (native_interpret_real): Declare.
	* fold-const.cc (native_interpret_real): No longer static.  Don't
	perform MODE_COMPOSITE_P verification here.
	(native_interpret_expr) <case REAL_TYPE>: But perform it here instead
	for all modes.
	* gimple-fold.cc (clear_padding_type): Call native_interpret_real
	instead of native_interpret_expr.
	* simplify-rtx.cc (simplify_immed_subreg): Perform the native_encode_rtx
	and comparison verification for all FLOAT_MODE_P modes, not just
	MODE_COMPOSITE_P.

	* gcc.dg/pr104522.c: New test.


	Jakub

Comments

Richard Biener Feb. 15, 2022, 10:55 a.m. UTC | #1
On Tue, 15 Feb 2022, Jakub Jelinek wrote:

> Hi!
> 
> For IBM double double I've added in PR95450 and PR99648 verification that
> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
> or CONST_DOUBLE constant, we try to encode it back to target bytes and
> verify it is the same.
> This is because our real.c support isn't able to represent all valid values
> of IBM double double which has variable precision.
> In PR104522, it has been noted that we have similar problem with the
> Intel/Motorola extended XFmode formats, our internal representation isn't
> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
> values.
> So, the following patch is an attempt to extend that verification to all
> floats.
> Unfortunately, it wasn't that straightforward, because the
> __builtin_clear_padding code exactly for the XFmode long doubles needs to
> discover what bits are padding and does that by interpreting memory of
> all 1s.  That is actually a valid supported value, a qNaN with negative
> sign with all mantissa bits set, but the verification includes also the
> padding bits (exactly what __builtin_clear_padding wants to figure out)
> and so fails the comparison check and so we ICE.
> The patch fixes that case by moving that verification from
> native_interpret_real to its caller, so that clear_padding_type can
> call native_interpret_real and avoid that extra check.
> 
> With this, the only thing that regresses in the testsuite is
> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
> because it decides to use a pattern that has non-zero bits in the padding
> bits of the long double, so the simplify-rtx.cc change prevents folding
> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
> code at all opt levels) something like:
>         movabsq $-72340172838076674, %rax
>         movabsq $-72340172838076674, %rdx
>         movq    %rax, -48(%rbp)
>         movq    %rdx, -40(%rbp)
>         fldt    -48(%rbp)
>         fstpt   -32(%rbp)
> instead of
>         fldt    .LC2(%rip)
>         fstpt   -32(%rbp)
> ...
> .LC2:
>         .long   -16843010
>         .long   -16843010
>         .long   65278
>         .long   0
> Note, neither of those sequences actually stores the padding bits, fstpt
> simply doesn't touch them.
> For vars with clear_padding_real_needs_padding_p types that are allocated
> to memory at expansion time, I'd say much better would be to do the stores
> using integral modes rather than XFmode, so do that:
>         movabsq $-72340172838076674, %rax
> 	movq	%rax, -32(%rbp)
> 	movq	%rax, -24(%rbp)
> directly.  That is the only way to ensure the padding bits are initialized
> (or expand __builtin_clear_padding, but then you initialize separately the
> value bits and padding bits).
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
> above, the gcc.target/i386/auto-init-4.c case is unresolved.

OK.

Thanks,
Richard.

> 2022-02-15  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR middle-end/104522
> 	* fold-const.h (native_interpret_real): Declare.
> 	* fold-const.cc (native_interpret_real): No longer static.  Don't
> 	perform MODE_COMPOSITE_P verification here.
> 	(native_interpret_expr) <case REAL_TYPE>: But perform it here instead
> 	for all modes.
> 	* gimple-fold.cc (clear_padding_type): Call native_interpret_real
> 	instead of native_interpret_expr.
> 	* simplify-rtx.cc (simplify_immed_subreg): Perform the native_encode_rtx
> 	and comparison verification for all FLOAT_MODE_P modes, not just
> 	MODE_COMPOSITE_P.
> 
> 	* gcc.dg/pr104522.c: New test.
> 
> --- gcc/fold-const.h.jj	2022-02-07 21:26:50.717616208 +0100
> +++ gcc/fold-const.h	2022-02-15 01:16:14.509617954 +0100
> @@ -36,6 +36,7 @@ extern int native_encode_expr (const_tre
>  extern int native_encode_initializer (tree, unsigned char *, int,
>  				      int off = -1, unsigned char * = nullptr);
>  extern tree native_interpret_expr (tree, const unsigned char *, int);
> +extern tree native_interpret_real (tree, const unsigned char *, int);
>  extern bool can_native_interpret_type_p (tree);
>  extern tree native_interpret_aggregate (tree, const unsigned char *, int, int);
>  extern tree find_bitfield_repr_type (int, int);
> --- gcc/fold-const.cc.jj	2022-02-09 22:15:31.805466094 +0100
> +++ gcc/fold-const.cc	2022-02-15 01:36:11.988496438 +0100
> @@ -8643,7 +8643,7 @@ native_interpret_fixed (tree type, const
>     the buffer PTR of length LEN as a REAL_CST of type TYPE.
>     If the buffer cannot be interpreted, return NULL_TREE.  */
>  
> -static tree
> +tree
>  native_interpret_real (tree type, const unsigned char *ptr, int len)
>  {
>    scalar_float_mode mode = SCALAR_FLOAT_TYPE_MODE (type);
> @@ -8694,19 +8694,7 @@ native_interpret_real (tree type, const
>      }
>  
>    real_from_target (&r, tmp, mode);
> -  tree ret = build_real (type, r);
> -  if (MODE_COMPOSITE_P (mode))
> -    {
> -      /* For floating point values in composite modes, punt if this folding
> -	 doesn't preserve bit representation.  As the mode doesn't have fixed
> -	 precision while GCC pretends it does, there could be valid values that
> -	 GCC can't really represent accurately.  See PR95450.  */
> -      unsigned char buf[24];
> -      if (native_encode_expr (ret, buf, total_bytes, 0) != total_bytes
> -	  || memcmp (ptr, buf, total_bytes) != 0)
> -	ret = NULL_TREE;
> -    }
> -  return ret;
> +  return build_real (type, r);
>  }
>  
>  
> @@ -8824,7 +8812,23 @@ native_interpret_expr (tree type, const
>        return native_interpret_int (type, ptr, len);
>  
>      case REAL_TYPE:
> -      return native_interpret_real (type, ptr, len);
> +      if (tree ret = native_interpret_real (type, ptr, len))
> +	{
> +	  /* For floating point values in composite modes, punt if this
> +	     folding doesn't preserve bit representation.  As the mode doesn't
> +	     have fixed precision while GCC pretends it does, there could be
> +	     valid values that GCC can't really represent accurately.
> +	     See PR95450.  Even for other modes, e.g. x86 XFmode can have some
> +	     bit combinationations which GCC doesn't preserve.  */
> +	  unsigned char buf[24];
> +	  scalar_float_mode mode = SCALAR_FLOAT_TYPE_MODE (type);
> +	  int total_bytes = GET_MODE_SIZE (mode);
> +	  if (native_encode_expr (ret, buf, total_bytes, 0) != total_bytes
> +	      || memcmp (ptr, buf, total_bytes) != 0)
> +	    return NULL_TREE;
> +	  return ret;
> +	}
> +      return NULL_TREE;
>  
>      case FIXED_POINT_TYPE:
>        return native_interpret_fixed (type, ptr, len);
> --- gcc/gimple-fold.cc.jj	2022-02-11 21:59:57.177698154 +0100
> +++ gcc/gimple-fold.cc	2022-02-15 01:21:08.706649224 +0100
> @@ -4807,10 +4807,10 @@ clear_padding_type (clear_padding_struct
>  	clear_padding_flush (buf, false);
>        if (clear_padding_real_needs_padding_p (type))
>  	{
> -	  /* Use native_interpret_expr + native_encode_expr to figure out
> +	  /* Use native_interpret_real + native_encode_expr to figure out
>  	     which bits are padding.  */
>  	  memset (buf->buf + buf->size, ~0, sz);
> -	  tree cst = native_interpret_expr (type, buf->buf + buf->size, sz);
> +	  tree cst = native_interpret_real (type, buf->buf + buf->size, sz);
>  	  gcc_assert (cst && TREE_CODE (cst) == REAL_CST);
>  	  int len = native_encode_expr (cst, buf->buf + buf->size, sz);
>  	  gcc_assert (len > 0 && (size_t) len == (size_t) sz);
> --- gcc/simplify-rtx.cc.jj	2022-01-20 21:24:40.611955521 +0100
> +++ gcc/simplify-rtx.cc	2022-02-14 22:14:29.258718212 +0100
> @@ -7302,7 +7302,7 @@ simplify_immed_subreg (fixed_size_mode o
>    else if (!native_encode_rtx (innermode, x, buffer, first_byte, inner_bytes))
>      return NULL_RTX;
>    rtx ret = native_decode_rtx (outermode, buffer, 0);
> -  if (ret && MODE_COMPOSITE_P (outermode))
> +  if (ret && FLOAT_MODE_P (outermode))
>      {
>        auto_vec<target_unit, 128> buffer2 (buffer_bytes);
>        if (!native_encode_rtx (outermode, ret, buffer2, 0, buffer_bytes))
> --- gcc/testsuite/gcc.dg/pr104522.c.jj	2022-02-14 22:14:29.258718212 +0100
> +++ gcc/testsuite/gcc.dg/pr104522.c	2022-02-14 22:14:29.258718212 +0100
> @@ -0,0 +1,14 @@
> +/* PR middle-end/104522 */
> +/* { dg-do compile } */
> +/* { dg-options "-O -fcompare-debug -dP" } */
> +
> +typedef short __attribute__((__vector_size__(16))) V;
> +long double x;
> +
> +void
> +foo (void)
> +{
> +  V t = { 512, 0, 0, 0, 16384 };
> +  long double u = *(long double *) &t;
> +  x /= u;
> +}
> 
> 	Jakub
> 
>
Qing Zhao Feb. 15, 2022, 4:30 p.m. UTC | #2
> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> 
> Hi!
> 
> For IBM double double I've added in PR95450 and PR99648 verification that
> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
> or CONST_DOUBLE constant, we try to encode it back to target bytes and
> verify it is the same.
> This is because our real.c support isn't able to represent all valid values
> of IBM double double which has variable precision.
> In PR104522, it has been noted that we have similar problem with the
> Intel/Motorola extended XFmode formats, our internal representation isn't
> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
> values.
> So, the following patch is an attempt to extend that verification to all
> floats.
> Unfortunately, it wasn't that straightforward, because the
> __builtin_clear_padding code exactly for the XFmode long doubles needs to
> discover what bits are padding and does that by interpreting memory of
> all 1s.  That is actually a valid supported value, a qNaN with negative
> sign with all mantissa bits set, but the verification includes also the
> padding bits (exactly what __builtin_clear_padding wants to figure out)
> and so fails the comparison check and so we ICE.
> The patch fixes that case by moving that verification from
> native_interpret_real to its caller, so that clear_padding_type can
> call native_interpret_real and avoid that extra check.
> 
> With this, the only thing that regresses in the testsuite is
> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
> because it decides to use a pattern that has non-zero bits in the padding
> bits of the long double, so the simplify-rtx.cc change prevents folding
> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
> code at all opt levels) something like:
>        movabsq $-72340172838076674, %rax
>        movabsq $-72340172838076674, %rdx
>        movq    %rax, -48(%rbp)
>        movq    %rdx, -40(%rbp)
>        fldt    -48(%rbp)
>        fstpt   -32(%rbp)
> instead of
>        fldt    .LC2(%rip)
>        fstpt   -32(%rbp)
> ...
> .LC2:
>        .long   -16843010
>        .long   -16843010
>        .long   65278
>        .long   0
> Note, neither of those sequences actually stores the padding bits, fstpt
> simply doesn't touch them.
> For vars with clear_padding_real_needs_padding_p types that are allocated
> to memory at expansion time, I'd say much better would be to do the stores
> using integral modes rather than XFmode, so do that:
>        movabsq $-72340172838076674, %rax
> 	movq	%rax, -32(%rbp)
> 	movq	%rax, -24(%rbp)
> directly.  That is the only way to ensure the padding bits are initialized
> (or expand __builtin_clear_padding, but then you initialize separately the
> value bits and padding bits).
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
> above, the gcc.target/i386/auto-init-4.c case is unresolved.

Thanks, I will try to fix this testing case in a later patch.

Qing
>
Richard Biener April 13, 2022, 8:41 a.m. UTC | #3
On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
>
> > On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> >
> > Hi!
> >
> > For IBM double double I've added in PR95450 and PR99648 verification that
> > when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
> > or CONST_DOUBLE constant, we try to encode it back to target bytes and
> > verify it is the same.
> > This is because our real.c support isn't able to represent all valid values
> > of IBM double double which has variable precision.
> > In PR104522, it has been noted that we have similar problem with the
> > Intel/Motorola extended XFmode formats, our internal representation isn't
> > able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
> > values.
> > So, the following patch is an attempt to extend that verification to all
> > floats.
> > Unfortunately, it wasn't that straightforward, because the
> > __builtin_clear_padding code exactly for the XFmode long doubles needs to
> > discover what bits are padding and does that by interpreting memory of
> > all 1s.  That is actually a valid supported value, a qNaN with negative
> > sign with all mantissa bits set, but the verification includes also the
> > padding bits (exactly what __builtin_clear_padding wants to figure out)
> > and so fails the comparison check and so we ICE.
> > The patch fixes that case by moving that verification from
> > native_interpret_real to its caller, so that clear_padding_type can
> > call native_interpret_real and avoid that extra check.
> >
> > With this, the only thing that regresses in the testsuite is
> > +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
> > because it decides to use a pattern that has non-zero bits in the padding
> > bits of the long double, so the simplify-rtx.cc change prevents folding
> > a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
> > code at all opt levels) something like:
> >        movabsq $-72340172838076674, %rax
> >        movabsq $-72340172838076674, %rdx
> >        movq    %rax, -48(%rbp)
> >        movq    %rdx, -40(%rbp)
> >        fldt    -48(%rbp)
> >        fstpt   -32(%rbp)
> > instead of
> >        fldt    .LC2(%rip)
> >        fstpt   -32(%rbp)
> > ...
> > .LC2:
> >        .long   -16843010
> >        .long   -16843010
> >        .long   65278
> >        .long   0
> > Note, neither of those sequences actually stores the padding bits, fstpt
> > simply doesn't touch them.
> > For vars with clear_padding_real_needs_padding_p types that are allocated
> > to memory at expansion time, I'd say much better would be to do the stores
> > using integral modes rather than XFmode, so do that:
> >        movabsq $-72340172838076674, %rax
> >       movq    %rax, -32(%rbp)
> >       movq    %rax, -24(%rbp)
> > directly.  That is the only way to ensure the padding bits are initialized
> > (or expand __builtin_clear_padding, but then you initialize separately the
> > value bits and padding bits).
> >
> > Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
> > above, the gcc.target/i386/auto-init-4.c case is unresolved.
>
> Thanks, I will try to fix this testing case in a later patch.

I've looked at this FAIL now and really wonder whether "pattern init" as
implemented makes any sense for non-integral types.  We end up with
initializing a register (SSA name) with

  VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)

as we go building a TImode constant (we verified we have a TImode SET!)
but then

          /* Pun the LHS to make sure its type has constant size
             unless it is an SSA name where that's already known.  */
          if (TREE_CODE (lhs) != SSA_NAME)
            lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
          else
            init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
...
      expand_assignment (lhs, init, false);

and generally registers do not have any padding.  This weird expansion
then causes us to spill the TImode constant and reload the XFmode value,
which is definitely not optimal here.

One approach to avoid the worse code generation would be to use mode
specific patterns for registers (like using a NaN or a target specific
value that
can be loaded cheaply), another would be to simply fall back to zero
initialization
when we fail to constant fold the initializer like with

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 8b1733e20c4..a4b995f71e4 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
          if (TREE_CODE (lhs) != SSA_NAME)
            lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
          else
-           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
+           {
+             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
+             if (!CONSTANT_CLASS_P (init))
+               init = build_zero_cst (TREE_TYPE (lhs));
+           }
        }
       else
        /* Use zero-init also for variable-length sizes.  */

note that still does not address the issue noted by Jakub that we do not
initialize padding this way - but of course that's because we expand a
special assignment from .DEFERRED_INIT and are not initializing
the storage allocated for the variable on the stack (at -O0) by RTL
expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
would take place there, simply filling the allocated frame with the
pattern or zero.  That would probably mean that RTL expansion
should scoop up .DEFERRED_INITs for variables it decides not to
expand to pseudos and not leave that to stmt expansion.

I'm going to simply XFAIL this testcase for GCC 12 and opened
PR105259 so we can revisit the above for GCC 13.

Richard.

> Qing
> >
>
Qing Zhao April 13, 2022, 3:22 p.m. UTC | #4
Hi, Richard,

Thanks a lot for taking a look at this issue (and Sorry that I haven’t fixed this one yet, I was distracted by other tasks then just forgot this one….)

> On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> 
> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>> 
>> 
>> 
>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>>> 
>>> Hi!
>>> 
>>> For IBM double double I've added in PR95450 and PR99648 verification that
>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
>>> verify it is the same.
>>> This is because our real.c support isn't able to represent all valid values
>>> of IBM double double which has variable precision.
>>> In PR104522, it has been noted that we have similar problem with the
>>> Intel/Motorola extended XFmode formats, our internal representation isn't
>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
>>> values.
>>> So, the following patch is an attempt to extend that verification to all
>>> floats.
>>> Unfortunately, it wasn't that straightforward, because the
>>> __builtin_clear_padding code exactly for the XFmode long doubles needs to
>>> discover what bits are padding and does that by interpreting memory of
>>> all 1s.  That is actually a valid supported value, a qNaN with negative
>>> sign with all mantissa bits set, but the verification includes also the
>>> padding bits (exactly what __builtin_clear_padding wants to figure out)
>>> and so fails the comparison check and so we ICE.
>>> The patch fixes that case by moving that verification from
>>> native_interpret_real to its caller, so that clear_padding_type can
>>> call native_interpret_real and avoid that extra check.
>>> 
>>> With this, the only thing that regresses in the testsuite is
>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
>>> because it decides to use a pattern that has non-zero bits in the padding
>>> bits of the long double, so the simplify-rtx.cc change prevents folding
>>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
>>> code at all opt levels) something like:
>>>       movabsq $-72340172838076674, %rax
>>>       movabsq $-72340172838076674, %rdx
>>>       movq    %rax, -48(%rbp)
>>>       movq    %rdx, -40(%rbp)
>>>       fldt    -48(%rbp)
>>>       fstpt   -32(%rbp)
>>> instead of
>>>       fldt    .LC2(%rip)
>>>       fstpt   -32(%rbp)
>>> ...
>>> .LC2:
>>>       .long   -16843010
>>>       .long   -16843010
>>>       .long   65278
>>>       .long   0
>>> Note, neither of those sequences actually stores the padding bits, fstpt
>>> simply doesn't touch them.
>>> For vars with clear_padding_real_needs_padding_p types that are allocated
>>> to memory at expansion time, I'd say much better would be to do the stores
>>> using integral modes rather than XFmode, so do that:
>>>       movabsq $-72340172838076674, %rax
>>>      movq    %rax, -32(%rbp)
>>>      movq    %rax, -24(%rbp)
>>> directly.  That is the only way to ensure the padding bits are initialized
>>> (or expand __builtin_clear_padding, but then you initialize separately the
>>> value bits and padding bits).
>>> 
>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
>>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
>> 
>> Thanks, I will try to fix this testing case in a later patch.
> 
> I've looked at this FAIL now and really wonder whether "pattern init" as
> implemented makes any sense for non-integral types.  
> We end up with
> initializing a register (SSA name) with
> 
>  VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
> 
> as we go building a TImode constant (we verified we have a TImode SET!)
> but then
> 
>          /* Pun the LHS to make sure its type has constant size
>             unless it is an SSA name where that's already known.  */
>          if (TREE_CODE (lhs) != SSA_NAME)
>            lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>          else
>            init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> ...
>      expand_assignment (lhs, init, false);
> 
> and generally registers do not have any padding.  This weird expansion
> then causes us to spill the TImode constant and reload the XFmode value,
> which is definitely not optimal here.
> 
> One approach to avoid the worse code generation would be to use mode
> specific patterns for registers (like using a NaN or a target specific
> value that
> can be loaded cheaply),

You mean that using “mode specific patterns” ONLY for registers? 
Can we use “mode specific patterns” consistently for both registers and memory? 
LLVM use different patterns for different types (Integral, Float, pointer type, etc…) in order to 
Catch bugs easily for different types. 

The beginning versions of this patch tried to do similar as LLVM, but later we gave up this and 
Use the same pattern for all different types. 

If now, we want to use “mode specific patterns” for registers, I am wondering whether it’s
Better to just use “mode specific patterns” consistently for both registers and memory?

> another would be to simply fall back to zero
> initialization
> when we fail to constant fold the initializer like with
> 
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 8b1733e20c4..a4b995f71e4 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
>          if (TREE_CODE (lhs) != SSA_NAME)
>            lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>          else
> -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> +           {
> +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> +             if (!CONSTANT_CLASS_P (init))
> +               init = build_zero_cst (TREE_TYPE (lhs));
> +           }
>        }
>       else
>        /* Use zero-init also for variable-length sizes.  */
> 
> note that still does not address the issue noted by Jakub that we do not
> initialize padding this way - but of course that's because we expand a
> special assignment from .DEFERRED_INIT and are not initializing
> the storage allocated for the variable on the stack (at -O0) by RTL
> expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
> would take place there, simply filling the allocated frame with the
> pattern or zero.  That would probably mean that RTL expansion
> should scoop up .DEFERRED_INITs for variables it decides not to
> expand to pseudos and not leave that to stmt expansion.

Need to study this a little bit more...

> 
> I'm going to simply XFAIL this testcase for GCC 12 and opened
> PR105259 so we can revisit the above for GCC 13.

I can further study this bug and try to come up with a good solution in gcc13.

Thank you!

Qing
> 
> Richard.
> 
>> Qing
>>> 
>>
Richard Biener April 14, 2022, 6:53 a.m. UTC | #5
On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>
> Hi, Richard,
>
> Thanks a lot for taking a look at this issue (and Sorry that I haven’t fixed this one yet, I was distracted by other tasks then just forgot this one….)
>
> > On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >>
> >>
> >>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> >>>
> >>> Hi!
> >>>
> >>> For IBM double double I've added in PR95450 and PR99648 verification that
> >>> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
> >>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
> >>> verify it is the same.
> >>> This is because our real.c support isn't able to represent all valid values
> >>> of IBM double double which has variable precision.
> >>> In PR104522, it has been noted that we have similar problem with the
> >>> Intel/Motorola extended XFmode formats, our internal representation isn't
> >>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
> >>> values.
> >>> So, the following patch is an attempt to extend that verification to all
> >>> floats.
> >>> Unfortunately, it wasn't that straightforward, because the
> >>> __builtin_clear_padding code exactly for the XFmode long doubles needs to
> >>> discover what bits are padding and does that by interpreting memory of
> >>> all 1s.  That is actually a valid supported value, a qNaN with negative
> >>> sign with all mantissa bits set, but the verification includes also the
> >>> padding bits (exactly what __builtin_clear_padding wants to figure out)
> >>> and so fails the comparison check and so we ICE.
> >>> The patch fixes that case by moving that verification from
> >>> native_interpret_real to its caller, so that clear_padding_type can
> >>> call native_interpret_real and avoid that extra check.
> >>>
> >>> With this, the only thing that regresses in the testsuite is
> >>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
> >>> because it decides to use a pattern that has non-zero bits in the padding
> >>> bits of the long double, so the simplify-rtx.cc change prevents folding
> >>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
> >>> code at all opt levels) something like:
> >>>       movabsq $-72340172838076674, %rax
> >>>       movabsq $-72340172838076674, %rdx
> >>>       movq    %rax, -48(%rbp)
> >>>       movq    %rdx, -40(%rbp)
> >>>       fldt    -48(%rbp)
> >>>       fstpt   -32(%rbp)
> >>> instead of
> >>>       fldt    .LC2(%rip)
> >>>       fstpt   -32(%rbp)
> >>> ...
> >>> .LC2:
> >>>       .long   -16843010
> >>>       .long   -16843010
> >>>       .long   65278
> >>>       .long   0
> >>> Note, neither of those sequences actually stores the padding bits, fstpt
> >>> simply doesn't touch them.
> >>> For vars with clear_padding_real_needs_padding_p types that are allocated
> >>> to memory at expansion time, I'd say much better would be to do the stores
> >>> using integral modes rather than XFmode, so do that:
> >>>       movabsq $-72340172838076674, %rax
> >>>      movq    %rax, -32(%rbp)
> >>>      movq    %rax, -24(%rbp)
> >>> directly.  That is the only way to ensure the padding bits are initialized
> >>> (or expand __builtin_clear_padding, but then you initialize separately the
> >>> value bits and padding bits).
> >>>
> >>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
> >>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
> >>
> >> Thanks, I will try to fix this testing case in a later patch.
> >
> > I've looked at this FAIL now and really wonder whether "pattern init" as
> > implemented makes any sense for non-integral types.
> > We end up with
> > initializing a register (SSA name) with
> >
> >  VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
> >
> > as we go building a TImode constant (we verified we have a TImode SET!)
> > but then
> >
> >          /* Pun the LHS to make sure its type has constant size
> >             unless it is an SSA name where that's already known.  */
> >          if (TREE_CODE (lhs) != SSA_NAME)
> >            lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >          else
> >            init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> > ...
> >      expand_assignment (lhs, init, false);
> >
> > and generally registers do not have any padding.  This weird expansion
> > then causes us to spill the TImode constant and reload the XFmode value,
> > which is definitely not optimal here.
> >
> > One approach to avoid the worse code generation would be to use mode
> > specific patterns for registers (like using a NaN or a target specific
> > value that
> > can be loaded cheaply),
>
> You mean that using “mode specific patterns” ONLY for registers?
> Can we use “mode specific patterns” consistently for both registers and memory?

The difference is that registers don't have padding while memory
possibly does, also
for aggregates using different patterns is too expensive IMHO.  For
registers the extra
complication with generic patterns is that efficient initialization
(without going through memory)
should be a priority IMHO.

And for stack memory I still think that initializing the full
allocated frame (including padding
between variables!) is the best approach.

> LLVM use different patterns for different types (Integral, Float, pointer type, etc…) in order to
> Catch bugs easily for different types.
>
> The beginning versions of this patch tried to do similar as LLVM, but later we gave up this and
> Use the same pattern for all different types.
>
> If now, we want to use “mode specific patterns” for registers, I am wondering whether it’s
> Better to just use “mode specific patterns” consistently for both registers and memory?
>
> > another would be to simply fall back to zero
> > initialization
> > when we fail to constant fold the initializer like with
> >
> > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > index 8b1733e20c4..a4b995f71e4 100644
> > --- a/gcc/internal-fn.cc
> > +++ b/gcc/internal-fn.cc
> > @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
> >          if (TREE_CODE (lhs) != SSA_NAME)
> >            lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >          else
> > -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> > +           {
> > +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> > +             if (!CONSTANT_CLASS_P (init))
> > +               init = build_zero_cst (TREE_TYPE (lhs));
> > +           }
> >        }
> >       else
> >        /* Use zero-init also for variable-length sizes.  */
> >
> > note that still does not address the issue noted by Jakub that we do not
> > initialize padding this way - but of course that's because we expand a
> > special assignment from .DEFERRED_INIT and are not initializing
> > the storage allocated for the variable on the stack (at -O0) by RTL
> > expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
> > would take place there, simply filling the allocated frame with the
> > pattern or zero.  That would probably mean that RTL expansion
> > should scoop up .DEFERRED_INITs for variables it decides not to
> > expand to pseudos and not leave that to stmt expansion.
>
> Need to study this a little bit more...
>
> >
> > I'm going to simply XFAIL this testcase for GCC 12 and opened
> > PR105259 so we can revisit the above for GCC 13.
>
> I can further study this bug and try to come up with a good solution in gcc13.
>
> Thank you!
>
> Qing
> >
> > Richard.
> >
> >> Qing
> >>>
> >>
>
Qing Zhao April 19, 2022, 9:36 p.m. UTC | #6
> On Apr 14, 2022, at 1:53 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> 
> On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>> 
>> Hi, Richard,
>> 
>> Thanks a lot for taking a look at this issue (and Sorry that I haven’t fixed this one yet, I was distracted by other tasks then just forgot this one….)
>> 
>>> On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guenther@gmail.com> wrote:
>>> 
>>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
>>> <gcc-patches@gcc.gnu.org> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>>>>> 
>>>>> Hi!
>>>>> 
>>>>> For IBM double double I've added in PR95450 and PR99648 verification that
>>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
>>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
>>>>> verify it is the same.
>>>>> This is because our real.c support isn't able to represent all valid values
>>>>> of IBM double double which has variable precision.
>>>>> In PR104522, it has been noted that we have similar problem with the
>>>>> Intel/Motorola extended XFmode formats, our internal representation isn't
>>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
>>>>> values.
>>>>> So, the following patch is an attempt to extend that verification to all
>>>>> floats.
>>>>> Unfortunately, it wasn't that straightforward, because the
>>>>> __builtin_clear_padding code exactly for the XFmode long doubles needs to
>>>>> discover what bits are padding and does that by interpreting memory of
>>>>> all 1s.  That is actually a valid supported value, a qNaN with negative
>>>>> sign with all mantissa bits set, but the verification includes also the
>>>>> padding bits (exactly what __builtin_clear_padding wants to figure out)
>>>>> and so fails the comparison check and so we ICE.
>>>>> The patch fixes that case by moving that verification from
>>>>> native_interpret_real to its caller, so that clear_padding_type can
>>>>> call native_interpret_real and avoid that extra check.
>>>>> 
>>>>> With this, the only thing that regresses in the testsuite is
>>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
>>>>> because it decides to use a pattern that has non-zero bits in the padding
>>>>> bits of the long double, so the simplify-rtx.cc change prevents folding
>>>>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
>>>>> code at all opt levels) something like:
>>>>>      movabsq $-72340172838076674, %rax
>>>>>      movabsq $-72340172838076674, %rdx
>>>>>      movq    %rax, -48(%rbp)
>>>>>      movq    %rdx, -40(%rbp)
>>>>>      fldt    -48(%rbp)
>>>>>      fstpt   -32(%rbp)
>>>>> instead of
>>>>>      fldt    .LC2(%rip)
>>>>>      fstpt   -32(%rbp)
>>>>> ...
>>>>> .LC2:
>>>>>      .long   -16843010
>>>>>      .long   -16843010
>>>>>      .long   65278
>>>>>      .long   0
>>>>> Note, neither of those sequences actually stores the padding bits, fstpt
>>>>> simply doesn't touch them.
>>>>> For vars with clear_padding_real_needs_padding_p types that are allocated
>>>>> to memory at expansion time, I'd say much better would be to do the stores
>>>>> using integral modes rather than XFmode, so do that:
>>>>>      movabsq $-72340172838076674, %rax
>>>>>     movq    %rax, -32(%rbp)
>>>>>     movq    %rax, -24(%rbp)
>>>>> directly.  That is the only way to ensure the padding bits are initialized
>>>>> (or expand __builtin_clear_padding, but then you initialize separately the
>>>>> value bits and padding bits).
>>>>> 
>>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
>>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
>>>> 
>>>> Thanks, I will try to fix this testing case in a later patch.
>>> 
>>> I've looked at this FAIL now and really wonder whether "pattern init" as
>>> implemented makes any sense for non-integral types.
>>> We end up with
>>> initializing a register (SSA name) with
>>> 
>>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
>>> 
>>> as we go building a TImode constant (we verified we have a TImode SET!)
>>> but then
>>> 
>>>         /* Pun the LHS to make sure its type has constant size
>>>            unless it is an SSA name where that's already known.  */
>>>         if (TREE_CODE (lhs) != SSA_NAME)
>>>           lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>>>         else
>>>           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>> ...
>>>     expand_assignment (lhs, init, false);
>>> 
>>> and generally registers do not have any padding.  This weird expansion
>>> then causes us to spill the TImode constant and reload the XFmode value,
>>> which is definitely not optimal here.
>>> 
>>> One approach to avoid the worse code generation would be to use mode
>>> specific patterns for registers (like using a NaN or a target specific
>>> value that
>>> can be loaded cheaply),
>> 
>> You mean that using “mode specific patterns” ONLY for registers?
>> Can we use “mode specific patterns” consistently for both registers and memory?
> 
> The difference is that registers don't have padding while memory
> possibly does, also
> for aggregates using different patterns is too expensive IMHO.  For
> registers the extra
> complication with generic patterns is that efficient initialization
> (without going through memory)
> should be a priority IMHO.
> 
> And for stack memory I still think that initializing the full
> allocated frame (including padding
> between variables!) is the best approach.
> 
>> LLVM use different patterns for different types (Integral, Float, pointer type, etc…) in order to
>> Catch bugs easily for different types.
>> 
>> The beginning versions of this patch tried to do similar as LLVM, but later we gave up this and
>> Use the same pattern for all different types.
>> 
>> If now, we want to use “mode specific patterns” for registers, I am wondering whether it’s
>> Better to just use “mode specific patterns” consistently for both registers and memory?
>> 
>>> another would be to simply fall back to zero
>>> initialization
>>> when we fail to constant fold the initializer like with
>>> 
>>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
>>> index 8b1733e20c4..a4b995f71e4 100644
>>> --- a/gcc/internal-fn.cc
>>> +++ b/gcc/internal-fn.cc
>>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
>>>         if (TREE_CODE (lhs) != SSA_NAME)
>>>           lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>>>         else
>>> -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>> +           {
>>> +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>> +             if (!CONSTANT_CLASS_P (init))
>>> +               init = build_zero_cst (TREE_TYPE (lhs));
>>> +           }
>>>       }
>>>      else
>>>       /* Use zero-init also for variable-length sizes.  */
>>> 
>>> note that still does not address the issue noted by Jakub that we do not
>>> initialize padding this way - but of course that's because we expand a
>>> special assignment from .DEFERRED_INIT and are not initializing
>>> the storage allocated for the variable on the stack (at -O0) by RTL
>>> expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
>>> would take place there, simply filling the allocated frame with the
>>> pattern or zero.  That would probably mean that RTL expansion
>>> should scoop up .DEFERRED_INITs for variables it decides not to
>>> expand to pseudos and not leave that to stmt expansion.
>> 
>> Need to study this a little bit more...


So, Is what you mean in the above a complete rewrite of the current “expand_DEFERRED_INIT”:

Instead of simply using “expand_builtin_memset” for “variables that are not in register” and  “expand_assignment”
for “variables that are in register”,  RTL expansion should directly expand this call in a lower level: 

i.e,

  tree lhs = gimple_call_lhs (stmt);
  …
  
  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);

  If (MEM_P (target))   // the variable is allocated on stack
  {
     // filling the allocated frame with pattern or zero.   How to do it??
  }
  else			   // the variable is in pseudo register. 
  {
    rtx init_rtx = …;
    emit_move_insn (target, init_rtx)
  }

  
Is the above a correct understanding? 

Thanks.

Qing

>> 
>>> 
>>> I'm going to simply XFAIL this testcase for GCC 12 and opened
>>> PR105259 so we can revisit the above for GCC 13.
>> 
>> I can further study this bug and try to come up with a good solution in gcc13.
>> 
>> Thank you!
>> 
>> Qing
>>> 
>>> Richard.
>>> 
>>>> Qing
>>>>> 
>>>> 
>>
Richard Biener April 20, 2022, 10:38 a.m. UTC | #7
On Tue, Apr 19, 2022 at 11:36 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>
>
>
> > On Apr 14, 2022, at 1:53 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.zhao@oracle.com> wrote:
> >>
> >> Hi, Richard,
> >>
> >> Thanks a lot for taking a look at this issue (and Sorry that I haven’t fixed this one yet, I was distracted by other tasks then just forgot this one….)
> >>
> >>> On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >>>
> >>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
> >>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> For IBM double double I've added in PR95450 and PR99648 verification that
> >>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
> >>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
> >>>>> verify it is the same.
> >>>>> This is because our real.c support isn't able to represent all valid values
> >>>>> of IBM double double which has variable precision.
> >>>>> In PR104522, it has been noted that we have similar problem with the
> >>>>> Intel/Motorola extended XFmode formats, our internal representation isn't
> >>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
> >>>>> values.
> >>>>> So, the following patch is an attempt to extend that verification to all
> >>>>> floats.
> >>>>> Unfortunately, it wasn't that straightforward, because the
> >>>>> __builtin_clear_padding code exactly for the XFmode long doubles needs to
> >>>>> discover what bits are padding and does that by interpreting memory of
> >>>>> all 1s.  That is actually a valid supported value, a qNaN with negative
> >>>>> sign with all mantissa bits set, but the verification includes also the
> >>>>> padding bits (exactly what __builtin_clear_padding wants to figure out)
> >>>>> and so fails the comparison check and so we ICE.
> >>>>> The patch fixes that case by moving that verification from
> >>>>> native_interpret_real to its caller, so that clear_padding_type can
> >>>>> call native_interpret_real and avoid that extra check.
> >>>>>
> >>>>> With this, the only thing that regresses in the testsuite is
> >>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
> >>>>> because it decides to use a pattern that has non-zero bits in the padding
> >>>>> bits of the long double, so the simplify-rtx.cc change prevents folding
> >>>>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
> >>>>> code at all opt levels) something like:
> >>>>>      movabsq $-72340172838076674, %rax
> >>>>>      movabsq $-72340172838076674, %rdx
> >>>>>      movq    %rax, -48(%rbp)
> >>>>>      movq    %rdx, -40(%rbp)
> >>>>>      fldt    -48(%rbp)
> >>>>>      fstpt   -32(%rbp)
> >>>>> instead of
> >>>>>      fldt    .LC2(%rip)
> >>>>>      fstpt   -32(%rbp)
> >>>>> ...
> >>>>> .LC2:
> >>>>>      .long   -16843010
> >>>>>      .long   -16843010
> >>>>>      .long   65278
> >>>>>      .long   0
> >>>>> Note, neither of those sequences actually stores the padding bits, fstpt
> >>>>> simply doesn't touch them.
> >>>>> For vars with clear_padding_real_needs_padding_p types that are allocated
> >>>>> to memory at expansion time, I'd say much better would be to do the stores
> >>>>> using integral modes rather than XFmode, so do that:
> >>>>>      movabsq $-72340172838076674, %rax
> >>>>>     movq    %rax, -32(%rbp)
> >>>>>     movq    %rax, -24(%rbp)
> >>>>> directly.  That is the only way to ensure the padding bits are initialized
> >>>>> (or expand __builtin_clear_padding, but then you initialize separately the
> >>>>> value bits and padding bits).
> >>>>>
> >>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
> >>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
> >>>>
> >>>> Thanks, I will try to fix this testing case in a later patch.
> >>>
> >>> I've looked at this FAIL now and really wonder whether "pattern init" as
> >>> implemented makes any sense for non-integral types.
> >>> We end up with
> >>> initializing a register (SSA name) with
> >>>
> >>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
> >>>
> >>> as we go building a TImode constant (we verified we have a TImode SET!)
> >>> but then
> >>>
> >>>         /* Pun the LHS to make sure its type has constant size
> >>>            unless it is an SSA name where that's already known.  */
> >>>         if (TREE_CODE (lhs) != SSA_NAME)
> >>>           lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >>>         else
> >>>           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>> ...
> >>>     expand_assignment (lhs, init, false);
> >>>
> >>> and generally registers do not have any padding.  This weird expansion
> >>> then causes us to spill the TImode constant and reload the XFmode value,
> >>> which is definitely not optimal here.
> >>>
> >>> One approach to avoid the worse code generation would be to use mode
> >>> specific patterns for registers (like using a NaN or a target specific
> >>> value that
> >>> can be loaded cheaply),
> >>
> >> You mean that using “mode specific patterns” ONLY for registers?
> >> Can we use “mode specific patterns” consistently for both registers and memory?
> >
> > The difference is that registers don't have padding while memory
> > possibly does, also
> > for aggregates using different patterns is too expensive IMHO.  For
> > registers the extra
> > complication with generic patterns is that efficient initialization
> > (without going through memory)
> > should be a priority IMHO.
> >
> > And for stack memory I still think that initializing the full
> > allocated frame (including padding
> > between variables!) is the best approach.
> >
> >> LLVM use different patterns for different types (Integral, Float, pointer type, etc…) in order to
> >> Catch bugs easily for different types.
> >>
> >> The beginning versions of this patch tried to do similar as LLVM, but later we gave up this and
> >> Use the same pattern for all different types.
> >>
> >> If now, we want to use “mode specific patterns” for registers, I am wondering whether it’s
> >> Better to just use “mode specific patterns” consistently for both registers and memory?
> >>
> >>> another would be to simply fall back to zero
> >>> initialization
> >>> when we fail to constant fold the initializer like with
> >>>
> >>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> >>> index 8b1733e20c4..a4b995f71e4 100644
> >>> --- a/gcc/internal-fn.cc
> >>> +++ b/gcc/internal-fn.cc
> >>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
> >>>         if (TREE_CODE (lhs) != SSA_NAME)
> >>>           lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >>>         else
> >>> -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>> +           {
> >>> +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>> +             if (!CONSTANT_CLASS_P (init))
> >>> +               init = build_zero_cst (TREE_TYPE (lhs));
> >>> +           }
> >>>       }
> >>>      else
> >>>       /* Use zero-init also for variable-length sizes.  */
> >>>
> >>> note that still does not address the issue noted by Jakub that we do not
> >>> initialize padding this way - but of course that's because we expand a
> >>> special assignment from .DEFERRED_INIT and are not initializing
> >>> the storage allocated for the variable on the stack (at -O0) by RTL
> >>> expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
> >>> would take place there, simply filling the allocated frame with the
> >>> pattern or zero.  That would probably mean that RTL expansion
> >>> should scoop up .DEFERRED_INITs for variables it decides not to
> >>> expand to pseudos and not leave that to stmt expansion.
> >>
> >> Need to study this a little bit more...
>
>
> So, Is what you mean in the above a complete rewrite of the current “expand_DEFERRED_INIT”:
>
> Instead of simply using “expand_builtin_memset” for “variables that are not in register” and  “expand_assignment”
> for “variables that are in register”,  RTL expansion should directly expand this call in a lower level:
>
> i.e,
>
>   tree lhs = gimple_call_lhs (stmt);
>   …
>
>   rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
>
>   If (MEM_P (target))   // the variable is allocated on stack
>   {
>      // filling the allocated frame with pattern or zero.   How to do it??
>   }
>   else                     // the variable is in pseudo register.
>   {
>     rtx init_rtx = …;
>     emit_move_insn (target, init_rtx)
>   }
>
>
> Is the above a correct understanding?

No, I would simply expand the automatic in-memory var .DEFERRED_INIT
calls to nothing
but instead process their effect at the time we do
expand_one_stack_var, which means at
CFG expansion time scan for the .DEFERRED_INITs and record whether and how to
initialize the variables.  We can then group the variables accordingly
and block-initialize
the stack space also covering padding inbetween variables.

Richard.

> Thanks.
>
> Qing
>
> >>
> >>>
> >>> I'm going to simply XFAIL this testcase for GCC 12 and opened
> >>> PR105259 so we can revisit the above for GCC 13.
> >>
> >> I can further study this bug and try to come up with a good solution in gcc13.
> >>
> >> Thank you!
> >>
> >> Qing
> >>>
> >>> Richard.
> >>>
> >>>> Qing
> >>>>>
> >>>>
> >>
>
Qing Zhao April 20, 2022, 4:02 p.m. UTC | #8
> On Apr 20, 2022, at 5:38 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> 
> On Tue, Apr 19, 2022 at 11:36 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>> 
>> 
>> 
>>> On Apr 14, 2022, at 1:53 AM, Richard Biener <richard.guenther@gmail.com> wrote:
>>> 
>>> On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>>>> 
>>>> Hi, Richard,
>>>> 
>>>> Thanks a lot for taking a look at this issue (and Sorry that I haven’t fixed this one yet, I was distracted by other tasks then just forgot this one….)
>>>> 
>>>>> On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guenther@gmail.com> wrote:
>>>>> 
>>>>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>>>>>>> 
>>>>>>> Hi!
>>>>>>> 
>>>>>>> For IBM double double I've added in PR95450 and PR99648 verification that
>>>>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
>>>>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
>>>>>>> verify it is the same.
>>>>>>> This is because our real.c support isn't able to represent all valid values
>>>>>>> of IBM double double which has variable precision.
>>>>>>> In PR104522, it has been noted that we have similar problem with the
>>>>>>> Intel/Motorola extended XFmode formats, our internal representation isn't
>>>>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
>>>>>>> values.
>>>>>>> So, the following patch is an attempt to extend that verification to all
>>>>>>> floats.
>>>>>>> Unfortunately, it wasn't that straightforward, because the
>>>>>>> __builtin_clear_padding code exactly for the XFmode long doubles needs to
>>>>>>> discover what bits are padding and does that by interpreting memory of
>>>>>>> all 1s.  That is actually a valid supported value, a qNaN with negative
>>>>>>> sign with all mantissa bits set, but the verification includes also the
>>>>>>> padding bits (exactly what __builtin_clear_padding wants to figure out)
>>>>>>> and so fails the comparison check and so we ICE.
>>>>>>> The patch fixes that case by moving that verification from
>>>>>>> native_interpret_real to its caller, so that clear_padding_type can
>>>>>>> call native_interpret_real and avoid that extra check.
>>>>>>> 
>>>>>>> With this, the only thing that regresses in the testsuite is
>>>>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
>>>>>>> because it decides to use a pattern that has non-zero bits in the padding
>>>>>>> bits of the long double, so the simplify-rtx.cc change prevents folding
>>>>>>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
>>>>>>> code at all opt levels) something like:
>>>>>>>     movabsq $-72340172838076674, %rax
>>>>>>>     movabsq $-72340172838076674, %rdx
>>>>>>>     movq    %rax, -48(%rbp)
>>>>>>>     movq    %rdx, -40(%rbp)
>>>>>>>     fldt    -48(%rbp)
>>>>>>>     fstpt   -32(%rbp)
>>>>>>> instead of
>>>>>>>     fldt    .LC2(%rip)
>>>>>>>     fstpt   -32(%rbp)
>>>>>>> ...
>>>>>>> .LC2:
>>>>>>>     .long   -16843010
>>>>>>>     .long   -16843010
>>>>>>>     .long   65278
>>>>>>>     .long   0
>>>>>>> Note, neither of those sequences actually stores the padding bits, fstpt
>>>>>>> simply doesn't touch them.
>>>>>>> For vars with clear_padding_real_needs_padding_p types that are allocated
>>>>>>> to memory at expansion time, I'd say much better would be to do the stores
>>>>>>> using integral modes rather than XFmode, so do that:
>>>>>>>     movabsq $-72340172838076674, %rax
>>>>>>>    movq    %rax, -32(%rbp)
>>>>>>>    movq    %rax, -24(%rbp)
>>>>>>> directly.  That is the only way to ensure the padding bits are initialized
>>>>>>> (or expand __builtin_clear_padding, but then you initialize separately the
>>>>>>> value bits and padding bits).
>>>>>>> 
>>>>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
>>>>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
>>>>>> 
>>>>>> Thanks, I will try to fix this testing case in a later patch.
>>>>> 
>>>>> I've looked at this FAIL now and really wonder whether "pattern init" as
>>>>> implemented makes any sense for non-integral types.
>>>>> We end up with
>>>>> initializing a register (SSA name) with
>>>>> 
>>>>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
>>>>> 
>>>>> as we go building a TImode constant (we verified we have a TImode SET!)
>>>>> but then
>>>>> 
>>>>>        /* Pun the LHS to make sure its type has constant size
>>>>>           unless it is an SSA name where that's already known.  */
>>>>>        if (TREE_CODE (lhs) != SSA_NAME)
>>>>>          lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>>>>>        else
>>>>>          init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>>>> ...
>>>>>    expand_assignment (lhs, init, false);
>>>>> 
>>>>> and generally registers do not have any padding.  This weird expansion
>>>>> then causes us to spill the TImode constant and reload the XFmode value,
>>>>> which is definitely not optimal here.
>>>>> 
>>>>> One approach to avoid the worse code generation would be to use mode
>>>>> specific patterns for registers (like using a NaN or a target specific
>>>>> value that
>>>>> can be loaded cheaply),
>>>> 
>>>> You mean that using “mode specific patterns” ONLY for registers?
>>>> Can we use “mode specific patterns” consistently for both registers and memory?
>>> 
>>> The difference is that registers don't have padding while memory
>>> possibly does, also
>>> for aggregates using different patterns is too expensive IMHO.  For
>>> registers the extra
>>> complication with generic patterns is that efficient initialization
>>> (without going through memory)
>>> should be a priority IMHO.
>>> 
>>> And for stack memory I still think that initializing the full
>>> allocated frame (including padding
>>> between variables!) is the best approach.
>>> 
>>>> LLVM use different patterns for different types (Integral, Float, pointer type, etc…) in order to
>>>> Catch bugs easily for different types.
>>>> 
>>>> The beginning versions of this patch tried to do similar as LLVM, but later we gave up this and
>>>> Use the same pattern for all different types.
>>>> 
>>>> If now, we want to use “mode specific patterns” for registers, I am wondering whether it’s
>>>> Better to just use “mode specific patterns” consistently for both registers and memory?
>>>> 
>>>>> another would be to simply fall back to zero
>>>>> initialization
>>>>> when we fail to constant fold the initializer like with
>>>>> 
>>>>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
>>>>> index 8b1733e20c4..a4b995f71e4 100644
>>>>> --- a/gcc/internal-fn.cc
>>>>> +++ b/gcc/internal-fn.cc
>>>>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
>>>>>        if (TREE_CODE (lhs) != SSA_NAME)
>>>>>          lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>>>>>        else
>>>>> -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>>>> +           {
>>>>> +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>>>> +             if (!CONSTANT_CLASS_P (init))
>>>>> +               init = build_zero_cst (TREE_TYPE (lhs));
>>>>> +           }
>>>>>      }
>>>>>     else
>>>>>      /* Use zero-init also for variable-length sizes.  */
>>>>> 
>>>>> note that still does not address the issue noted by Jakub that we do not
>>>>> initialize padding this way - but of course that's because we expand a
>>>>> special assignment from .DEFERRED_INIT and are not initializing
>>>>> the storage allocated for the variable on the stack (at -O0) by RTL
>>>>> expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
>>>>> would take place there, simply filling the allocated frame with the
>>>>> pattern or zero.  That would probably mean that RTL expansion
>>>>> should scoop up .DEFERRED_INITs for variables it decides not to
>>>>> expand to pseudos and not leave that to stmt expansion.
>>>> 
>>>> Need to study this a little bit more...
>> 
>> 
>> So, Is what you mean in the above a complete rewrite of the current “expand_DEFERRED_INIT”:
>> 
>> Instead of simply using “expand_builtin_memset” for “variables that are not in register” and  “expand_assignment”
>> for “variables that are in register”,  RTL expansion should directly expand this call in a lower level:
>> 
>> i.e,
>> 
>>  tree lhs = gimple_call_lhs (stmt);
>>  …
>> 
>>  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
>> 
>>  If (MEM_P (target))   // the variable is allocated on stack
>>  {
>>     // filling the allocated frame with pattern or zero.   How to do it??
>>  }
>>  else                     // the variable is in pseudo register.
>>  {
>>    rtx init_rtx = …;
>>    emit_move_insn (target, init_rtx)
>>  }
>> 
>> 
>> Is the above a correct understanding?
> 
> No, I would simply expand the automatic in-memory var .DEFERRED_INIT
> calls to nothing
> but instead process their effect at the time we do
> expand_one_stack_var, which means at
> CFG expansion time scan for the .DEFERRED_INITs and record whether and how to
> initialize the variables.  We can then group the variables accordingly
> and block-initialize
> the stack space also covering padding inbetween variables.

So, the above only covers the variables that will be allocated in stack?
For the variables that are in pseudo registers, we still need to simply expand the initialization to a move insn?

And for the variables in pseudo registers, later after register allocation, some of them will be spilled into stack, too.
Specially for long double/complex double variables that might have padding, shall we specially handle them during that time?

Qing
> 
> Richard.
> 
>> Thanks.
>> 
>> Qing
>> 
>>>> 
>>>>> 
>>>>> I'm going to simply XFAIL this testcase for GCC 12 and opened
>>>>> PR105259 so we can revisit the above for GCC 13.
>>>> 
>>>> I can further study this bug and try to come up with a good solution in gcc13.
>>>> 
>>>> Thank you!
>>>> 
>>>> Qing
>>>>> 
>>>>> Richard.
>>>>> 
>>>>>> Qing
Richard Biener April 21, 2022, 7:09 a.m. UTC | #9
On Wed, Apr 20, 2022 at 6:02 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>
>
>
> > On Apr 20, 2022, at 5:38 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Tue, Apr 19, 2022 at 11:36 PM Qing Zhao <qing.zhao@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Apr 14, 2022, at 1:53 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >>>
> >>> On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.zhao@oracle.com> wrote:
> >>>>
> >>>> Hi, Richard,
> >>>>
> >>>> Thanks a lot for taking a look at this issue (and Sorry that I haven’t fixed this one yet, I was distracted by other tasks then just forgot this one….)
> >>>>
> >>>>> On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >>>>>
> >>>>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
> >>>>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> >>>>>>>
> >>>>>>> Hi!
> >>>>>>>
> >>>>>>> For IBM double double I've added in PR95450 and PR99648 verification that
> >>>>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
> >>>>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
> >>>>>>> verify it is the same.
> >>>>>>> This is because our real.c support isn't able to represent all valid values
> >>>>>>> of IBM double double which has variable precision.
> >>>>>>> In PR104522, it has been noted that we have similar problem with the
> >>>>>>> Intel/Motorola extended XFmode formats, our internal representation isn't
> >>>>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
> >>>>>>> values.
> >>>>>>> So, the following patch is an attempt to extend that verification to all
> >>>>>>> floats.
> >>>>>>> Unfortunately, it wasn't that straightforward, because the
> >>>>>>> __builtin_clear_padding code exactly for the XFmode long doubles needs to
> >>>>>>> discover what bits are padding and does that by interpreting memory of
> >>>>>>> all 1s.  That is actually a valid supported value, a qNaN with negative
> >>>>>>> sign with all mantissa bits set, but the verification includes also the
> >>>>>>> padding bits (exactly what __builtin_clear_padding wants to figure out)
> >>>>>>> and so fails the comparison check and so we ICE.
> >>>>>>> The patch fixes that case by moving that verification from
> >>>>>>> native_interpret_real to its caller, so that clear_padding_type can
> >>>>>>> call native_interpret_real and avoid that extra check.
> >>>>>>>
> >>>>>>> With this, the only thing that regresses in the testsuite is
> >>>>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
> >>>>>>> because it decides to use a pattern that has non-zero bits in the padding
> >>>>>>> bits of the long double, so the simplify-rtx.cc change prevents folding
> >>>>>>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
> >>>>>>> code at all opt levels) something like:
> >>>>>>>     movabsq $-72340172838076674, %rax
> >>>>>>>     movabsq $-72340172838076674, %rdx
> >>>>>>>     movq    %rax, -48(%rbp)
> >>>>>>>     movq    %rdx, -40(%rbp)
> >>>>>>>     fldt    -48(%rbp)
> >>>>>>>     fstpt   -32(%rbp)
> >>>>>>> instead of
> >>>>>>>     fldt    .LC2(%rip)
> >>>>>>>     fstpt   -32(%rbp)
> >>>>>>> ...
> >>>>>>> .LC2:
> >>>>>>>     .long   -16843010
> >>>>>>>     .long   -16843010
> >>>>>>>     .long   65278
> >>>>>>>     .long   0
> >>>>>>> Note, neither of those sequences actually stores the padding bits, fstpt
> >>>>>>> simply doesn't touch them.
> >>>>>>> For vars with clear_padding_real_needs_padding_p types that are allocated
> >>>>>>> to memory at expansion time, I'd say much better would be to do the stores
> >>>>>>> using integral modes rather than XFmode, so do that:
> >>>>>>>     movabsq $-72340172838076674, %rax
> >>>>>>>    movq    %rax, -32(%rbp)
> >>>>>>>    movq    %rax, -24(%rbp)
> >>>>>>> directly.  That is the only way to ensure the padding bits are initialized
> >>>>>>> (or expand __builtin_clear_padding, but then you initialize separately the
> >>>>>>> value bits and padding bits).
> >>>>>>>
> >>>>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
> >>>>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
> >>>>>>
> >>>>>> Thanks, I will try to fix this testing case in a later patch.
> >>>>>
> >>>>> I've looked at this FAIL now and really wonder whether "pattern init" as
> >>>>> implemented makes any sense for non-integral types.
> >>>>> We end up with
> >>>>> initializing a register (SSA name) with
> >>>>>
> >>>>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
> >>>>>
> >>>>> as we go building a TImode constant (we verified we have a TImode SET!)
> >>>>> but then
> >>>>>
> >>>>>        /* Pun the LHS to make sure its type has constant size
> >>>>>           unless it is an SSA name where that's already known.  */
> >>>>>        if (TREE_CODE (lhs) != SSA_NAME)
> >>>>>          lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >>>>>        else
> >>>>>          init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>>>> ...
> >>>>>    expand_assignment (lhs, init, false);
> >>>>>
> >>>>> and generally registers do not have any padding.  This weird expansion
> >>>>> then causes us to spill the TImode constant and reload the XFmode value,
> >>>>> which is definitely not optimal here.
> >>>>>
> >>>>> One approach to avoid the worse code generation would be to use mode
> >>>>> specific patterns for registers (like using a NaN or a target specific
> >>>>> value that
> >>>>> can be loaded cheaply),
> >>>>
> >>>> You mean that using “mode specific patterns” ONLY for registers?
> >>>> Can we use “mode specific patterns” consistently for both registers and memory?
> >>>
> >>> The difference is that registers don't have padding while memory
> >>> possibly does, also
> >>> for aggregates using different patterns is too expensive IMHO.  For
> >>> registers the extra
> >>> complication with generic patterns is that efficient initialization
> >>> (without going through memory)
> >>> should be a priority IMHO.
> >>>
> >>> And for stack memory I still think that initializing the full
> >>> allocated frame (including padding
> >>> between variables!) is the best approach.
> >>>
> >>>> LLVM use different patterns for different types (Integral, Float, pointer type, etc…) in order to
> >>>> Catch bugs easily for different types.
> >>>>
> >>>> The beginning versions of this patch tried to do similar as LLVM, but later we gave up this and
> >>>> Use the same pattern for all different types.
> >>>>
> >>>> If now, we want to use “mode specific patterns” for registers, I am wondering whether it’s
> >>>> Better to just use “mode specific patterns” consistently for both registers and memory?
> >>>>
> >>>>> another would be to simply fall back to zero
> >>>>> initialization
> >>>>> when we fail to constant fold the initializer like with
> >>>>>
> >>>>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> >>>>> index 8b1733e20c4..a4b995f71e4 100644
> >>>>> --- a/gcc/internal-fn.cc
> >>>>> +++ b/gcc/internal-fn.cc
> >>>>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
> >>>>>        if (TREE_CODE (lhs) != SSA_NAME)
> >>>>>          lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >>>>>        else
> >>>>> -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>>>> +           {
> >>>>> +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>>>> +             if (!CONSTANT_CLASS_P (init))
> >>>>> +               init = build_zero_cst (TREE_TYPE (lhs));
> >>>>> +           }
> >>>>>      }
> >>>>>     else
> >>>>>      /* Use zero-init also for variable-length sizes.  */
> >>>>>
> >>>>> note that still does not address the issue noted by Jakub that we do not
> >>>>> initialize padding this way - but of course that's because we expand a
> >>>>> special assignment from .DEFERRED_INIT and are not initializing
> >>>>> the storage allocated for the variable on the stack (at -O0) by RTL
> >>>>> expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
> >>>>> would take place there, simply filling the allocated frame with the
> >>>>> pattern or zero.  That would probably mean that RTL expansion
> >>>>> should scoop up .DEFERRED_INITs for variables it decides not to
> >>>>> expand to pseudos and not leave that to stmt expansion.
> >>>>
> >>>> Need to study this a little bit more...
> >>
> >>
> >> So, Is what you mean in the above a complete rewrite of the current “expand_DEFERRED_INIT”:
> >>
> >> Instead of simply using “expand_builtin_memset” for “variables that are not in register” and  “expand_assignment”
> >> for “variables that are in register”,  RTL expansion should directly expand this call in a lower level:
> >>
> >> i.e,
> >>
> >>  tree lhs = gimple_call_lhs (stmt);
> >>  …
> >>
> >>  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> >>
> >>  If (MEM_P (target))   // the variable is allocated on stack
> >>  {
> >>     // filling the allocated frame with pattern or zero.   How to do it??
> >>  }
> >>  else                     // the variable is in pseudo register.
> >>  {
> >>    rtx init_rtx = …;
> >>    emit_move_insn (target, init_rtx)
> >>  }
> >>
> >>
> >> Is the above a correct understanding?
> >
> > No, I would simply expand the automatic in-memory var .DEFERRED_INIT
> > calls to nothing
> > but instead process their effect at the time we do
> > expand_one_stack_var, which means at
> > CFG expansion time scan for the .DEFERRED_INITs and record whether and how to
> > initialize the variables.  We can then group the variables accordingly
> > and block-initialize
> > the stack space also covering padding inbetween variables.
>
> So, the above only covers the variables that will be allocated in stack?

Yes.

> For the variables that are in pseudo registers, we still need to simply expand the initialization to a move insn?

Yes, also for stack VLAs.

> And for the variables in pseudo registers, later after register allocation, some of them will be spilled into stack, too.
> Specially for long double/complex double variables that might have padding, shall we specially handle them during that time?

Ideally we'd never spill uninitialized variables but definitely the
auto-init can cause the need to spill that very
value.  That means that auto-init of registers should happen after RA?
 One would need to see what
IRA/LRA do to uninitialized pseudo uses.  Computing may uninit uses
after RA should be possible and
hopefully all ISAs allow initializing hard registers without requiring
a scratch register (heh - I'm almost sure
there will be exceptions ...)

Richard.

> Qing
> >
> > Richard.
> >
> >> Thanks.
> >>
> >> Qing
> >>
> >>>>
> >>>>>
> >>>>> I'm going to simply XFAIL this testcase for GCC 12 and opened
> >>>>> PR105259 so we can revisit the above for GCC 13.
> >>>>
> >>>> I can further study this bug and try to come up with a good solution in gcc13.
> >>>>
> >>>> Thank you!
> >>>>
> >>>> Qing
> >>>>>
> >>>>> Richard.
> >>>>>
> >>>>>> Qing
>
Qing Zhao April 22, 2022, 3 p.m. UTC | #10
> On Apr 20, 2022, at 5:38 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> 
> On Tue, Apr 19, 2022 at 11:36 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>> 
>> 
>> 
>>> On Apr 14, 2022, at 1:53 AM, Richard Biener <richard.guenther@gmail.com> wrote:
>>> 
>>> On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>>>> 
>>>> Hi, Richard,
>>>> 
>>>> Thanks a lot for taking a look at this issue (and Sorry that I haven’t fixed this one yet, I was distracted by other tasks then just forgot this one….)
>>>> 
>>>>> On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guenther@gmail.com> wrote:
>>>>> 
>>>>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>>>>>>> 
>>>>>>> Hi!
>>>>>>> 
>>>>>>> For IBM double double I've added in PR95450 and PR99648 verification that
>>>>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
>>>>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
>>>>>>> verify it is the same.
>>>>>>> This is because our real.c support isn't able to represent all valid values
>>>>>>> of IBM double double which has variable precision.
>>>>>>> In PR104522, it has been noted that we have similar problem with the
>>>>>>> Intel/Motorola extended XFmode formats, our internal representation isn't
>>>>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
>>>>>>> values.
>>>>>>> So, the following patch is an attempt to extend that verification to all
>>>>>>> floats.
>>>>>>> Unfortunately, it wasn't that straightforward, because the
>>>>>>> __builtin_clear_padding code exactly for the XFmode long doubles needs to
>>>>>>> discover what bits are padding and does that by interpreting memory of
>>>>>>> all 1s.  That is actually a valid supported value, a qNaN with negative
>>>>>>> sign with all mantissa bits set, but the verification includes also the
>>>>>>> padding bits (exactly what __builtin_clear_padding wants to figure out)
>>>>>>> and so fails the comparison check and so we ICE.
>>>>>>> The patch fixes that case by moving that verification from
>>>>>>> native_interpret_real to its caller, so that clear_padding_type can
>>>>>>> call native_interpret_real and avoid that extra check.
>>>>>>> 
>>>>>>> With this, the only thing that regresses in the testsuite is
>>>>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
>>>>>>> because it decides to use a pattern that has non-zero bits in the padding
>>>>>>> bits of the long double, so the simplify-rtx.cc change prevents folding
>>>>>>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
>>>>>>> code at all opt levels) something like:
>>>>>>>     movabsq $-72340172838076674, %rax
>>>>>>>     movabsq $-72340172838076674, %rdx
>>>>>>>     movq    %rax, -48(%rbp)
>>>>>>>     movq    %rdx, -40(%rbp)
>>>>>>>     fldt    -48(%rbp)
>>>>>>>     fstpt   -32(%rbp)
>>>>>>> instead of
>>>>>>>     fldt    .LC2(%rip)
>>>>>>>     fstpt   -32(%rbp)
>>>>>>> ...
>>>>>>> .LC2:
>>>>>>>     .long   -16843010
>>>>>>>     .long   -16843010
>>>>>>>     .long   65278
>>>>>>>     .long   0
>>>>>>> Note, neither of those sequences actually stores the padding bits, fstpt
>>>>>>> simply doesn't touch them.
>>>>>>> For vars with clear_padding_real_needs_padding_p types that are allocated
>>>>>>> to memory at expansion time, I'd say much better would be to do the stores
>>>>>>> using integral modes rather than XFmode, so do that:
>>>>>>>     movabsq $-72340172838076674, %rax
>>>>>>>    movq    %rax, -32(%rbp)
>>>>>>>    movq    %rax, -24(%rbp)
>>>>>>> directly.  That is the only way to ensure the padding bits are initialized
>>>>>>> (or expand __builtin_clear_padding, but then you initialize separately the
>>>>>>> value bits and padding bits).
>>>>>>> 
>>>>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
>>>>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
>>>>>> 
>>>>>> Thanks, I will try to fix this testing case in a later patch.
>>>>> 
>>>>> I've looked at this FAIL now and really wonder whether "pattern init" as
>>>>> implemented makes any sense for non-integral types.
>>>>> We end up with
>>>>> initializing a register (SSA name) with
>>>>> 
>>>>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
>>>>> 
>>>>> as we go building a TImode constant (we verified we have a TImode SET!)
>>>>> but then
>>>>> 
>>>>>        /* Pun the LHS to make sure its type has constant size
>>>>>           unless it is an SSA name where that's already known.  */
>>>>>        if (TREE_CODE (lhs) != SSA_NAME)
>>>>>          lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>>>>>        else
>>>>>          init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>>>> ...
>>>>>    expand_assignment (lhs, init, false);
>>>>> 
>>>>> and generally registers do not have any padding.  This weird expansion
>>>>> then causes us to spill the TImode constant and reload the XFmode value,
>>>>> which is definitely not optimal here.
>>>>> 
>>>>> One approach to avoid the worse code generation would be to use mode
>>>>> specific patterns for registers (like using a NaN or a target specific
>>>>> value that
>>>>> can be loaded cheaply),
>>>> 
>>>> You mean that using “mode specific patterns” ONLY for registers?
>>>> Can we use “mode specific patterns” consistently for both registers and memory?
>>> 
>>> The difference is that registers don't have padding while memory
>>> possibly does, also
>>> for aggregates using different patterns is too expensive IMHO.  For
>>> registers the extra
>>> complication with generic patterns is that efficient initialization
>>> (without going through memory)
>>> should be a priority IMHO.
>>> 
>>> And for stack memory I still think that initializing the full
>>> allocated frame (including padding
>>> between variables!) is the best approach.
>>> 
>>>> LLVM use different patterns for different types (Integral, Float, pointer type, etc…) in order to
>>>> Catch bugs easily for different types.
>>>> 
>>>> The beginning versions of this patch tried to do similar as LLVM, but later we gave up this and
>>>> Use the same pattern for all different types.
>>>> 
>>>> If now, we want to use “mode specific patterns” for registers, I am wondering whether it’s
>>>> Better to just use “mode specific patterns” consistently for both registers and memory?
>>>> 
>>>>> another would be to simply fall back to zero
>>>>> initialization
>>>>> when we fail to constant fold the initializer like with
>>>>> 
>>>>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
>>>>> index 8b1733e20c4..a4b995f71e4 100644
>>>>> --- a/gcc/internal-fn.cc
>>>>> +++ b/gcc/internal-fn.cc
>>>>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
>>>>>        if (TREE_CODE (lhs) != SSA_NAME)
>>>>>          lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>>>>>        else
>>>>> -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>>>> +           {
>>>>> +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>>>> +             if (!CONSTANT_CLASS_P (init))
>>>>> +               init = build_zero_cst (TREE_TYPE (lhs));
>>>>> +           }
>>>>>      }
>>>>>     else
>>>>>      /* Use zero-init also for variable-length sizes.  */
>>>>> 
>>>>> note that still does not address the issue noted by Jakub that we do not
>>>>> initialize padding this way - but of course that's because we expand a
>>>>> special assignment from .DEFERRED_INIT and are not initializing
>>>>> the storage allocated for the variable on the stack (at -O0) by RTL
>>>>> expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
>>>>> would take place there, simply filling the allocated frame with the
>>>>> pattern or zero.  That would probably mean that RTL expansion
>>>>> should scoop up .DEFERRED_INITs for variables it decides not to
>>>>> expand to pseudos and not leave that to stmt expansion.
>>>> 
>>>> Need to study this a little bit more...
>> 
>> 
>> So, Is what you mean in the above a complete rewrite of the current “expand_DEFERRED_INIT”:
>> 
>> Instead of simply using “expand_builtin_memset” for “variables that are not in register” and  “expand_assignment”
>> for “variables that are in register”,  RTL expansion should directly expand this call in a lower level:
>> 
>> i.e,
>> 
>>  tree lhs = gimple_call_lhs (stmt);
>>  …
>> 
>>  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
>> 
>>  If (MEM_P (target))   // the variable is allocated on stack
>>  {
>>     // filling the allocated frame with pattern or zero.   How to do it??
>>  }
>>  else                     // the variable is in pseudo register.
>>  {
>>    rtx init_rtx = …;
>>    emit_move_insn (target, init_rtx)
>>  }
>> 
>> 
>> Is the above a correct understanding?
> 
> No, I would simply expand the automatic in-memory var .DEFERRED_INIT
> calls to nothing
> but instead process their effect at the time we do
> expand_one_stack_var, which means at
> CFG expansion time scan for the .DEFERRED_INITs and record whether and how to
> initialize the variables.  We can then group the variables accordingly
> and block-initialize
> the stack space also covering padding inbetween variables.

I spent some time reading cfgexpand, especially the “expand_used_vars” part. I have the following questions:

1. Does “Group variables accordingly” mean: for all the auto variables that associate with .DEFERRED_INIT, during “expand_used_var” 
Phase, we should: first “defer the stack allocation” and marking them as “DEFERRED_INIT” , and later, during the stack grouping phase, in addition to the current 
“Phase 1”, “phase 2”, “phase 3” “expand_stack_vars”, add another phase in  “expand_stack_vars” for the ones marked with “DEFERRED_INIT”, group them together,
And then “block-initialize” this part of the stack space?

2. Looks like that “expand_used_vars” is done BEFORE “expand_gimple_basic_block”, So, if we insert the pattern-initialization RTLs for
Block-initializing  the stack space, how can we do it in the “expand_used_vars” phase before all the RTL statements are emitted?

Qing

> 
> Richard.
> 
>> Thanks.
>> 
>> Qing
>> 
>>>> 
>>>>> 
>>>>> I'm going to simply XFAIL this testcase for GCC 12 and opened
>>>>> PR105259 so we can revisit the above for GCC 13.
>>>> 
>>>> I can further study this bug and try to come up with a good solution in gcc13.
>>>> 
>>>> Thank you!
>>>> 
>>>> Qing
>>>>> 
>>>>> Richard.
>>>>> 
>>>>>> Qing
Qing Zhao April 22, 2022, 3:26 p.m. UTC | #11
> On Apr 21, 2022, at 2:09 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> 
> On Wed, Apr 20, 2022 at 6:02 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>> 
>> 
>> 
>>> On Apr 20, 2022, at 5:38 AM, Richard Biener <richard.guenther@gmail.com> wrote:
>>> 
>>> On Tue, Apr 19, 2022 at 11:36 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Apr 14, 2022, at 1:53 AM, Richard Biener <richard.guenther@gmail.com> wrote:
>>>>> 
>>>>> On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>>>>>> 
>>>>>> Hi, Richard,
>>>>>> 
>>>>>> Thanks a lot for taking a look at this issue (and Sorry that I haven’t fixed this one yet, I was distracted by other tasks then just forgot this one….)
>>>>>> 
>>>>>>> On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guenther@gmail.com> wrote:
>>>>>>> 
>>>>>>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
>>>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hi!
>>>>>>>>> 
>>>>>>>>> For IBM double double I've added in PR95450 and PR99648 verification that
>>>>>>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
>>>>>>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
>>>>>>>>> verify it is the same.
>>>>>>>>> This is because our real.c support isn't able to represent all valid values
>>>>>>>>> of IBM double double which has variable precision.
>>>>>>>>> In PR104522, it has been noted that we have similar problem with the
>>>>>>>>> Intel/Motorola extended XFmode formats, our internal representation isn't
>>>>>>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
>>>>>>>>> values.
>>>>>>>>> So, the following patch is an attempt to extend that verification to all
>>>>>>>>> floats.
>>>>>>>>> Unfortunately, it wasn't that straightforward, because the
>>>>>>>>> __builtin_clear_padding code exactly for the XFmode long doubles needs to
>>>>>>>>> discover what bits are padding and does that by interpreting memory of
>>>>>>>>> all 1s.  That is actually a valid supported value, a qNaN with negative
>>>>>>>>> sign with all mantissa bits set, but the verification includes also the
>>>>>>>>> padding bits (exactly what __builtin_clear_padding wants to figure out)
>>>>>>>>> and so fails the comparison check and so we ICE.
>>>>>>>>> The patch fixes that case by moving that verification from
>>>>>>>>> native_interpret_real to its caller, so that clear_padding_type can
>>>>>>>>> call native_interpret_real and avoid that extra check.
>>>>>>>>> 
>>>>>>>>> With this, the only thing that regresses in the testsuite is
>>>>>>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
>>>>>>>>> because it decides to use a pattern that has non-zero bits in the padding
>>>>>>>>> bits of the long double, so the simplify-rtx.cc change prevents folding
>>>>>>>>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
>>>>>>>>> code at all opt levels) something like:
>>>>>>>>>    movabsq $-72340172838076674, %rax
>>>>>>>>>    movabsq $-72340172838076674, %rdx
>>>>>>>>>    movq    %rax, -48(%rbp)
>>>>>>>>>    movq    %rdx, -40(%rbp)
>>>>>>>>>    fldt    -48(%rbp)
>>>>>>>>>    fstpt   -32(%rbp)
>>>>>>>>> instead of
>>>>>>>>>    fldt    .LC2(%rip)
>>>>>>>>>    fstpt   -32(%rbp)
>>>>>>>>> ...
>>>>>>>>> .LC2:
>>>>>>>>>    .long   -16843010
>>>>>>>>>    .long   -16843010
>>>>>>>>>    .long   65278
>>>>>>>>>    .long   0
>>>>>>>>> Note, neither of those sequences actually stores the padding bits, fstpt
>>>>>>>>> simply doesn't touch them.
>>>>>>>>> For vars with clear_padding_real_needs_padding_p types that are allocated
>>>>>>>>> to memory at expansion time, I'd say much better would be to do the stores
>>>>>>>>> using integral modes rather than XFmode, so do that:
>>>>>>>>>    movabsq $-72340172838076674, %rax
>>>>>>>>>   movq    %rax, -32(%rbp)
>>>>>>>>>   movq    %rax, -24(%rbp)
>>>>>>>>> directly.  That is the only way to ensure the padding bits are initialized
>>>>>>>>> (or expand __builtin_clear_padding, but then you initialize separately the
>>>>>>>>> value bits and padding bits).
>>>>>>>>> 
>>>>>>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
>>>>>>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
>>>>>>>> 
>>>>>>>> Thanks, I will try to fix this testing case in a later patch.
>>>>>>> 
>>>>>>> I've looked at this FAIL now and really wonder whether "pattern init" as
>>>>>>> implemented makes any sense for non-integral types.
>>>>>>> We end up with
>>>>>>> initializing a register (SSA name) with
>>>>>>> 
>>>>>>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
>>>>>>> 
>>>>>>> as we go building a TImode constant (we verified we have a TImode SET!)
>>>>>>> but then
>>>>>>> 
>>>>>>>       /* Pun the LHS to make sure its type has constant size
>>>>>>>          unless it is an SSA name where that's already known.  */
>>>>>>>       if (TREE_CODE (lhs) != SSA_NAME)
>>>>>>>         lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>>>>>>>       else
>>>>>>>         init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>>>>>> ...
>>>>>>>   expand_assignment (lhs, init, false);
>>>>>>> 
>>>>>>> and generally registers do not have any padding.  This weird expansion
>>>>>>> then causes us to spill the TImode constant and reload the XFmode value,
>>>>>>> which is definitely not optimal here.
>>>>>>> 
>>>>>>> One approach to avoid the worse code generation would be to use mode
>>>>>>> specific patterns for registers (like using a NaN or a target specific
>>>>>>> value that
>>>>>>> can be loaded cheaply),
>>>>>> 
>>>>>> You mean that using “mode specific patterns” ONLY for registers?
>>>>>> Can we use “mode specific patterns” consistently for both registers and memory?
>>>>> 
>>>>> The difference is that registers don't have padding while memory
>>>>> possibly does, also
>>>>> for aggregates using different patterns is too expensive IMHO.  For
>>>>> registers the extra
>>>>> complication with generic patterns is that efficient initialization
>>>>> (without going through memory)
>>>>> should be a priority IMHO.
>>>>> 
>>>>> And for stack memory I still think that initializing the full
>>>>> allocated frame (including padding
>>>>> between variables!) is the best approach.
>>>>> 
>>>>>> LLVM use different patterns for different types (Integral, Float, pointer type, etc…) in order to
>>>>>> Catch bugs easily for different types.
>>>>>> 
>>>>>> The beginning versions of this patch tried to do similar as LLVM, but later we gave up this and
>>>>>> Use the same pattern for all different types.
>>>>>> 
>>>>>> If now, we want to use “mode specific patterns” for registers, I am wondering whether it’s
>>>>>> Better to just use “mode specific patterns” consistently for both registers and memory?
>>>>>> 
>>>>>>> another would be to simply fall back to zero
>>>>>>> initialization
>>>>>>> when we fail to constant fold the initializer like with
>>>>>>> 
>>>>>>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
>>>>>>> index 8b1733e20c4..a4b995f71e4 100644
>>>>>>> --- a/gcc/internal-fn.cc
>>>>>>> +++ b/gcc/internal-fn.cc
>>>>>>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
>>>>>>>       if (TREE_CODE (lhs) != SSA_NAME)
>>>>>>>         lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
>>>>>>>       else
>>>>>>> -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>>>>>> +           {
>>>>>>> +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
>>>>>>> +             if (!CONSTANT_CLASS_P (init))
>>>>>>> +               init = build_zero_cst (TREE_TYPE (lhs));
>>>>>>> +           }
>>>>>>>     }
>>>>>>>    else
>>>>>>>     /* Use zero-init also for variable-length sizes.  */
>>>>>>> 
>>>>>>> note that still does not address the issue noted by Jakub that we do not
>>>>>>> initialize padding this way - but of course that's because we expand a
>>>>>>> special assignment from .DEFERRED_INIT and are not initializing
>>>>>>> the storage allocated for the variable on the stack (at -O0) by RTL
>>>>>>> expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
>>>>>>> would take place there, simply filling the allocated frame with the
>>>>>>> pattern or zero.  That would probably mean that RTL expansion
>>>>>>> should scoop up .DEFERRED_INITs for variables it decides not to
>>>>>>> expand to pseudos and not leave that to stmt expansion.
>>>>>> 
>>>>>> Need to study this a little bit more...
>>>> 
>>>> 
>>>> So, Is what you mean in the above a complete rewrite of the current “expand_DEFERRED_INIT”:
>>>> 
>>>> Instead of simply using “expand_builtin_memset” for “variables that are not in register” and  “expand_assignment”
>>>> for “variables that are in register”,  RTL expansion should directly expand this call in a lower level:
>>>> 
>>>> i.e,
>>>> 
>>>> tree lhs = gimple_call_lhs (stmt);
>>>> …
>>>> 
>>>> rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
>>>> 
>>>> If (MEM_P (target))   // the variable is allocated on stack
>>>> {
>>>>    // filling the allocated frame with pattern or zero.   How to do it??
>>>> }
>>>> else                     // the variable is in pseudo register.
>>>> {
>>>>   rtx init_rtx = …;
>>>>   emit_move_insn (target, init_rtx)
>>>> }
>>>> 
>>>> 
>>>> Is the above a correct understanding?
>>> 
>>> No, I would simply expand the automatic in-memory var .DEFERRED_INIT
>>> calls to nothing
>>> but instead process their effect at the time we do
>>> expand_one_stack_var, which means at
>>> CFG expansion time scan for the .DEFERRED_INITs and record whether and how to
>>> initialize the variables.  We can then group the variables accordingly
>>> and block-initialize
>>> the stack space also covering padding inbetween variables.
>> 
>> So, the above only covers the variables that will be allocated in stack?
> 
> Yes.
> 
>> For the variables that are in pseudo registers, we still need to simply expand the initialization to a move insn?
> 
> Yes, also for stack VLAs.

Why for stack VLAs? 
> 
>> And for the variables in pseudo registers, later after register allocation, some of them will be spilled into stack, too.
>> Specially for long double/complex double variables that might have padding, shall we specially handle them during that time?
> 
> Ideally we'd never spill uninitialized variables but definitely the
> auto-init can cause the need to spill that very
> value.  That means that auto-init of registers should happen after RA?

All the initialization of registers should happen after RA? Only the ones with “long double/complex double” type that might have padding?

> One would need to see what
> IRA/LRA do to uninitialized pseudo uses.  Computing may uninit uses
> after RA should be possible and
> hopefully all ISAs allow initializing hard registers without requiring
> a scratch register (heh - I'm almost sure
> there will be exceptions …)
Need more study here….

Qing
> 
> Richard.
> 
>> Qing
>>> 
>>> Richard.
>>> 
>>>> Thanks.
>>>> 
>>>> Qing
>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> I'm going to simply XFAIL this testcase for GCC 12 and opened
>>>>>>> PR105259 so we can revisit the above for GCC 13.
>>>>>> 
>>>>>> I can further study this bug and try to come up with a good solution in gcc13.
>>>>>> 
>>>>>> Thank you!
>>>>>> 
>>>>>> Qing
>>>>>>> 
>>>>>>> Richard.
>>>>>>> 
>>>>>>>> Qing
Richard Biener April 25, 2022, 6:24 a.m. UTC | #12
On Fri, Apr 22, 2022 at 5:01 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>
>
>
> > On Apr 20, 2022, at 5:38 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Tue, Apr 19, 2022 at 11:36 PM Qing Zhao <qing.zhao@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Apr 14, 2022, at 1:53 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >>>
> >>> On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.zhao@oracle.com> wrote:
> >>>>
> >>>> Hi, Richard,
> >>>>
> >>>> Thanks a lot for taking a look at this issue (and Sorry that I haven’t fixed this one yet, I was distracted by other tasks then just forgot this one….)
> >>>>
> >>>>> On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >>>>>
> >>>>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
> >>>>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> >>>>>>>
> >>>>>>> Hi!
> >>>>>>>
> >>>>>>> For IBM double double I've added in PR95450 and PR99648 verification that
> >>>>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
> >>>>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
> >>>>>>> verify it is the same.
> >>>>>>> This is because our real.c support isn't able to represent all valid values
> >>>>>>> of IBM double double which has variable precision.
> >>>>>>> In PR104522, it has been noted that we have similar problem with the
> >>>>>>> Intel/Motorola extended XFmode formats, our internal representation isn't
> >>>>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
> >>>>>>> values.
> >>>>>>> So, the following patch is an attempt to extend that verification to all
> >>>>>>> floats.
> >>>>>>> Unfortunately, it wasn't that straightforward, because the
> >>>>>>> __builtin_clear_padding code exactly for the XFmode long doubles needs to
> >>>>>>> discover what bits are padding and does that by interpreting memory of
> >>>>>>> all 1s.  That is actually a valid supported value, a qNaN with negative
> >>>>>>> sign with all mantissa bits set, but the verification includes also the
> >>>>>>> padding bits (exactly what __builtin_clear_padding wants to figure out)
> >>>>>>> and so fails the comparison check and so we ICE.
> >>>>>>> The patch fixes that case by moving that verification from
> >>>>>>> native_interpret_real to its caller, so that clear_padding_type can
> >>>>>>> call native_interpret_real and avoid that extra check.
> >>>>>>>
> >>>>>>> With this, the only thing that regresses in the testsuite is
> >>>>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
> >>>>>>> because it decides to use a pattern that has non-zero bits in the padding
> >>>>>>> bits of the long double, so the simplify-rtx.cc change prevents folding
> >>>>>>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
> >>>>>>> code at all opt levels) something like:
> >>>>>>>     movabsq $-72340172838076674, %rax
> >>>>>>>     movabsq $-72340172838076674, %rdx
> >>>>>>>     movq    %rax, -48(%rbp)
> >>>>>>>     movq    %rdx, -40(%rbp)
> >>>>>>>     fldt    -48(%rbp)
> >>>>>>>     fstpt   -32(%rbp)
> >>>>>>> instead of
> >>>>>>>     fldt    .LC2(%rip)
> >>>>>>>     fstpt   -32(%rbp)
> >>>>>>> ...
> >>>>>>> .LC2:
> >>>>>>>     .long   -16843010
> >>>>>>>     .long   -16843010
> >>>>>>>     .long   65278
> >>>>>>>     .long   0
> >>>>>>> Note, neither of those sequences actually stores the padding bits, fstpt
> >>>>>>> simply doesn't touch them.
> >>>>>>> For vars with clear_padding_real_needs_padding_p types that are allocated
> >>>>>>> to memory at expansion time, I'd say much better would be to do the stores
> >>>>>>> using integral modes rather than XFmode, so do that:
> >>>>>>>     movabsq $-72340172838076674, %rax
> >>>>>>>    movq    %rax, -32(%rbp)
> >>>>>>>    movq    %rax, -24(%rbp)
> >>>>>>> directly.  That is the only way to ensure the padding bits are initialized
> >>>>>>> (or expand __builtin_clear_padding, but then you initialize separately the
> >>>>>>> value bits and padding bits).
> >>>>>>>
> >>>>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
> >>>>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
> >>>>>>
> >>>>>> Thanks, I will try to fix this testing case in a later patch.
> >>>>>
> >>>>> I've looked at this FAIL now and really wonder whether "pattern init" as
> >>>>> implemented makes any sense for non-integral types.
> >>>>> We end up with
> >>>>> initializing a register (SSA name) with
> >>>>>
> >>>>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
> >>>>>
> >>>>> as we go building a TImode constant (we verified we have a TImode SET!)
> >>>>> but then
> >>>>>
> >>>>>        /* Pun the LHS to make sure its type has constant size
> >>>>>           unless it is an SSA name where that's already known.  */
> >>>>>        if (TREE_CODE (lhs) != SSA_NAME)
> >>>>>          lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >>>>>        else
> >>>>>          init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>>>> ...
> >>>>>    expand_assignment (lhs, init, false);
> >>>>>
> >>>>> and generally registers do not have any padding.  This weird expansion
> >>>>> then causes us to spill the TImode constant and reload the XFmode value,
> >>>>> which is definitely not optimal here.
> >>>>>
> >>>>> One approach to avoid the worse code generation would be to use mode
> >>>>> specific patterns for registers (like using a NaN or a target specific
> >>>>> value that
> >>>>> can be loaded cheaply),
> >>>>
> >>>> You mean that using “mode specific patterns” ONLY for registers?
> >>>> Can we use “mode specific patterns” consistently for both registers and memory?
> >>>
> >>> The difference is that registers don't have padding while memory
> >>> possibly does, also
> >>> for aggregates using different patterns is too expensive IMHO.  For
> >>> registers the extra
> >>> complication with generic patterns is that efficient initialization
> >>> (without going through memory)
> >>> should be a priority IMHO.
> >>>
> >>> And for stack memory I still think that initializing the full
> >>> allocated frame (including padding
> >>> between variables!) is the best approach.
> >>>
> >>>> LLVM use different patterns for different types (Integral, Float, pointer type, etc…) in order to
> >>>> Catch bugs easily for different types.
> >>>>
> >>>> The beginning versions of this patch tried to do similar as LLVM, but later we gave up this and
> >>>> Use the same pattern for all different types.
> >>>>
> >>>> If now, we want to use “mode specific patterns” for registers, I am wondering whether it’s
> >>>> Better to just use “mode specific patterns” consistently for both registers and memory?
> >>>>
> >>>>> another would be to simply fall back to zero
> >>>>> initialization
> >>>>> when we fail to constant fold the initializer like with
> >>>>>
> >>>>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> >>>>> index 8b1733e20c4..a4b995f71e4 100644
> >>>>> --- a/gcc/internal-fn.cc
> >>>>> +++ b/gcc/internal-fn.cc
> >>>>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
> >>>>>        if (TREE_CODE (lhs) != SSA_NAME)
> >>>>>          lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >>>>>        else
> >>>>> -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>>>> +           {
> >>>>> +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>>>> +             if (!CONSTANT_CLASS_P (init))
> >>>>> +               init = build_zero_cst (TREE_TYPE (lhs));
> >>>>> +           }
> >>>>>      }
> >>>>>     else
> >>>>>      /* Use zero-init also for variable-length sizes.  */
> >>>>>
> >>>>> note that still does not address the issue noted by Jakub that we do not
> >>>>> initialize padding this way - but of course that's because we expand a
> >>>>> special assignment from .DEFERRED_INIT and are not initializing
> >>>>> the storage allocated for the variable on the stack (at -O0) by RTL
> >>>>> expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
> >>>>> would take place there, simply filling the allocated frame with the
> >>>>> pattern or zero.  That would probably mean that RTL expansion
> >>>>> should scoop up .DEFERRED_INITs for variables it decides not to
> >>>>> expand to pseudos and not leave that to stmt expansion.
> >>>>
> >>>> Need to study this a little bit more...
> >>
> >>
> >> So, Is what you mean in the above a complete rewrite of the current “expand_DEFERRED_INIT”:
> >>
> >> Instead of simply using “expand_builtin_memset” for “variables that are not in register” and  “expand_assignment”
> >> for “variables that are in register”,  RTL expansion should directly expand this call in a lower level:
> >>
> >> i.e,
> >>
> >>  tree lhs = gimple_call_lhs (stmt);
> >>  …
> >>
> >>  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> >>
> >>  If (MEM_P (target))   // the variable is allocated on stack
> >>  {
> >>     // filling the allocated frame with pattern or zero.   How to do it??
> >>  }
> >>  else                     // the variable is in pseudo register.
> >>  {
> >>    rtx init_rtx = …;
> >>    emit_move_insn (target, init_rtx)
> >>  }
> >>
> >>
> >> Is the above a correct understanding?
> >
> > No, I would simply expand the automatic in-memory var .DEFERRED_INIT
> > calls to nothing
> > but instead process their effect at the time we do
> > expand_one_stack_var, which means at
> > CFG expansion time scan for the .DEFERRED_INITs and record whether and how to
> > initialize the variables.  We can then group the variables accordingly
> > and block-initialize
> > the stack space also covering padding inbetween variables.
>
> I spent some time reading cfgexpand, especially the “expand_used_vars” part. I have the following questions:
>
> 1. Does “Group variables accordingly” mean: for all the auto variables that associate with .DEFERRED_INIT, during “expand_used_var”
> Phase, we should: first “defer the stack allocation” and marking them as “DEFERRED_INIT” , and later, during the stack grouping phase, in addition to the current
> “Phase 1”, “phase 2”, “phase 3” “expand_stack_vars”, add another phase in  “expand_stack_vars” for the ones marked with “DEFERRED_INIT”, group them together,
> And then “block-initialize” this part of the stack space?

Yes.

> 2. Looks like that “expand_used_vars” is done BEFORE “expand_gimple_basic_block”, So, if we insert the pattern-initialization RTLs for
> Block-initializing  the stack space, how can we do it in the “expand_used_vars” phase before all the RTL statements are emitted?

We probably have to emit the actual block initialization at the start
of the function.

>
> Qing
>
> >
> > Richard.
> >
> >> Thanks.
> >>
> >> Qing
> >>
> >>>>
> >>>>>
> >>>>> I'm going to simply XFAIL this testcase for GCC 12 and opened
> >>>>> PR105259 so we can revisit the above for GCC 13.
> >>>>
> >>>> I can further study this bug and try to come up with a good solution in gcc13.
> >>>>
> >>>> Thank you!
> >>>>
> >>>> Qing
> >>>>>
> >>>>> Richard.
> >>>>>
> >>>>>> Qing
>
Richard Biener April 25, 2022, 6:27 a.m. UTC | #13
On Fri, Apr 22, 2022 at 5:26 PM Qing Zhao <qing.zhao@oracle.com> wrote:
>
>
>
> > On Apr 21, 2022, at 2:09 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Apr 20, 2022 at 6:02 PM Qing Zhao <qing.zhao@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Apr 20, 2022, at 5:38 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >>>
> >>> On Tue, Apr 19, 2022 at 11:36 PM Qing Zhao <qing.zhao@oracle.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Apr 14, 2022, at 1:53 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >>>>>
> >>>>> On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.zhao@oracle.com> wrote:
> >>>>>>
> >>>>>> Hi, Richard,
> >>>>>>
> >>>>>> Thanks a lot for taking a look at this issue (and Sorry that I haven’t fixed this one yet, I was distracted by other tasks then just forgot this one….)
> >>>>>>
> >>>>>>> On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> >>>>>>>
> >>>>>>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
> >>>>>>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi!
> >>>>>>>>>
> >>>>>>>>> For IBM double double I've added in PR95450 and PR99648 verification that
> >>>>>>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a REAL_CST
> >>>>>>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
> >>>>>>>>> verify it is the same.
> >>>>>>>>> This is because our real.c support isn't able to represent all valid values
> >>>>>>>>> of IBM double double which has variable precision.
> >>>>>>>>> In PR104522, it has been noted that we have similar problem with the
> >>>>>>>>> Intel/Motorola extended XFmode formats, our internal representation isn't
> >>>>>>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and unnormal
> >>>>>>>>> values.
> >>>>>>>>> So, the following patch is an attempt to extend that verification to all
> >>>>>>>>> floats.
> >>>>>>>>> Unfortunately, it wasn't that straightforward, because the
> >>>>>>>>> __builtin_clear_padding code exactly for the XFmode long doubles needs to
> >>>>>>>>> discover what bits are padding and does that by interpreting memory of
> >>>>>>>>> all 1s.  That is actually a valid supported value, a qNaN with negative
> >>>>>>>>> sign with all mantissa bits set, but the verification includes also the
> >>>>>>>>> padding bits (exactly what __builtin_clear_padding wants to figure out)
> >>>>>>>>> and so fails the comparison check and so we ICE.
> >>>>>>>>> The patch fixes that case by moving that verification from
> >>>>>>>>> native_interpret_real to its caller, so that clear_padding_type can
> >>>>>>>>> call native_interpret_real and avoid that extra check.
> >>>>>>>>>
> >>>>>>>>> With this, the only thing that regresses in the testsuite is
> >>>>>>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times long\\t-16843010 5
> >>>>>>>>> because it decides to use a pattern that has non-zero bits in the padding
> >>>>>>>>> bits of the long double, so the simplify-rtx.cc change prevents folding
> >>>>>>>>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit worse
> >>>>>>>>> code at all opt levels) something like:
> >>>>>>>>>    movabsq $-72340172838076674, %rax
> >>>>>>>>>    movabsq $-72340172838076674, %rdx
> >>>>>>>>>    movq    %rax, -48(%rbp)
> >>>>>>>>>    movq    %rdx, -40(%rbp)
> >>>>>>>>>    fldt    -48(%rbp)
> >>>>>>>>>    fstpt   -32(%rbp)
> >>>>>>>>> instead of
> >>>>>>>>>    fldt    .LC2(%rip)
> >>>>>>>>>    fstpt   -32(%rbp)
> >>>>>>>>> ...
> >>>>>>>>> .LC2:
> >>>>>>>>>    .long   -16843010
> >>>>>>>>>    .long   -16843010
> >>>>>>>>>    .long   65278
> >>>>>>>>>    .long   0
> >>>>>>>>> Note, neither of those sequences actually stores the padding bits, fstpt
> >>>>>>>>> simply doesn't touch them.
> >>>>>>>>> For vars with clear_padding_real_needs_padding_p types that are allocated
> >>>>>>>>> to memory at expansion time, I'd say much better would be to do the stores
> >>>>>>>>> using integral modes rather than XFmode, so do that:
> >>>>>>>>>    movabsq $-72340172838076674, %rax
> >>>>>>>>>   movq    %rax, -32(%rbp)
> >>>>>>>>>   movq    %rax, -24(%rbp)
> >>>>>>>>> directly.  That is the only way to ensure the padding bits are initialized
> >>>>>>>>> (or expand __builtin_clear_padding, but then you initialize separately the
> >>>>>>>>> value bits and padding bits).
> >>>>>>>>>
> >>>>>>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as mentioned
> >>>>>>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
> >>>>>>>>
> >>>>>>>> Thanks, I will try to fix this testing case in a later patch.
> >>>>>>>
> >>>>>>> I've looked at this FAIL now and really wonder whether "pattern init" as
> >>>>>>> implemented makes any sense for non-integral types.
> >>>>>>> We end up with
> >>>>>>> initializing a register (SSA name) with
> >>>>>>>
> >>>>>>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
> >>>>>>>
> >>>>>>> as we go building a TImode constant (we verified we have a TImode SET!)
> >>>>>>> but then
> >>>>>>>
> >>>>>>>       /* Pun the LHS to make sure its type has constant size
> >>>>>>>          unless it is an SSA name where that's already known.  */
> >>>>>>>       if (TREE_CODE (lhs) != SSA_NAME)
> >>>>>>>         lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >>>>>>>       else
> >>>>>>>         init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>>>>>> ...
> >>>>>>>   expand_assignment (lhs, init, false);
> >>>>>>>
> >>>>>>> and generally registers do not have any padding.  This weird expansion
> >>>>>>> then causes us to spill the TImode constant and reload the XFmode value,
> >>>>>>> which is definitely not optimal here.
> >>>>>>>
> >>>>>>> One approach to avoid the worse code generation would be to use mode
> >>>>>>> specific patterns for registers (like using a NaN or a target specific
> >>>>>>> value that
> >>>>>>> can be loaded cheaply),
> >>>>>>
> >>>>>> You mean that using “mode specific patterns” ONLY for registers?
> >>>>>> Can we use “mode specific patterns” consistently for both registers and memory?
> >>>>>
> >>>>> The difference is that registers don't have padding while memory
> >>>>> possibly does, also
> >>>>> for aggregates using different patterns is too expensive IMHO.  For
> >>>>> registers the extra
> >>>>> complication with generic patterns is that efficient initialization
> >>>>> (without going through memory)
> >>>>> should be a priority IMHO.
> >>>>>
> >>>>> And for stack memory I still think that initializing the full
> >>>>> allocated frame (including padding
> >>>>> between variables!) is the best approach.
> >>>>>
> >>>>>> LLVM use different patterns for different types (Integral, Float, pointer type, etc…) in order to
> >>>>>> Catch bugs easily for different types.
> >>>>>>
> >>>>>> The beginning versions of this patch tried to do similar as LLVM, but later we gave up this and
> >>>>>> Use the same pattern for all different types.
> >>>>>>
> >>>>>> If now, we want to use “mode specific patterns” for registers, I am wondering whether it’s
> >>>>>> Better to just use “mode specific patterns” consistently for both registers and memory?
> >>>>>>
> >>>>>>> another would be to simply fall back to zero
> >>>>>>> initialization
> >>>>>>> when we fail to constant fold the initializer like with
> >>>>>>>
> >>>>>>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> >>>>>>> index 8b1733e20c4..a4b995f71e4 100644
> >>>>>>> --- a/gcc/internal-fn.cc
> >>>>>>> +++ b/gcc/internal-fn.cc
> >>>>>>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
> >>>>>>>       if (TREE_CODE (lhs) != SSA_NAME)
> >>>>>>>         lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >>>>>>>       else
> >>>>>>> -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>>>>>> +           {
> >>>>>>> +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>>>>>> +             if (!CONSTANT_CLASS_P (init))
> >>>>>>> +               init = build_zero_cst (TREE_TYPE (lhs));
> >>>>>>> +           }
> >>>>>>>     }
> >>>>>>>    else
> >>>>>>>     /* Use zero-init also for variable-length sizes.  */
> >>>>>>>
> >>>>>>> note that still does not address the issue noted by Jakub that we do not
> >>>>>>> initialize padding this way - but of course that's because we expand a
> >>>>>>> special assignment from .DEFERRED_INIT and are not initializing
> >>>>>>> the storage allocated for the variable on the stack (at -O0) by RTL
> >>>>>>> expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
> >>>>>>> would take place there, simply filling the allocated frame with the
> >>>>>>> pattern or zero.  That would probably mean that RTL expansion
> >>>>>>> should scoop up .DEFERRED_INITs for variables it decides not to
> >>>>>>> expand to pseudos and not leave that to stmt expansion.
> >>>>>>
> >>>>>> Need to study this a little bit more...
> >>>>
> >>>>
> >>>> So, Is what you mean in the above a complete rewrite of the current “expand_DEFERRED_INIT”:
> >>>>
> >>>> Instead of simply using “expand_builtin_memset” for “variables that are not in register” and  “expand_assignment”
> >>>> for “variables that are in register”,  RTL expansion should directly expand this call in a lower level:
> >>>>
> >>>> i.e,
> >>>>
> >>>> tree lhs = gimple_call_lhs (stmt);
> >>>> …
> >>>>
> >>>> rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> >>>>
> >>>> If (MEM_P (target))   // the variable is allocated on stack
> >>>> {
> >>>>    // filling the allocated frame with pattern or zero.   How to do it??
> >>>> }
> >>>> else                     // the variable is in pseudo register.
> >>>> {
> >>>>   rtx init_rtx = …;
> >>>>   emit_move_insn (target, init_rtx)
> >>>> }
> >>>>
> >>>>
> >>>> Is the above a correct understanding?
> >>>
> >>> No, I would simply expand the automatic in-memory var .DEFERRED_INIT
> >>> calls to nothing
> >>> but instead process their effect at the time we do
> >>> expand_one_stack_var, which means at
> >>> CFG expansion time scan for the .DEFERRED_INITs and record whether and how to
> >>> initialize the variables.  We can then group the variables accordingly
> >>> and block-initialize
> >>> the stack space also covering padding inbetween variables.
> >>
> >> So, the above only covers the variables that will be allocated in stack?
> >
> > Yes.
> >
> >> For the variables that are in pseudo registers, we still need to simply expand the initialization to a move insn?
> >
> > Yes, also for stack VLAs.
>
> Why for stack VLAs?

VLAs are allocated with alloca which means the lifetime of VLAs might
not be that of the whole function
so we have to do it when the variable becomes live.  Incidentially
that also applies to variables we would
be able to coalesce to the same stack slot, so we have to double-check
that still works when attempting
to block-initialize.

> >
> >> And for the variables in pseudo registers, later after register allocation, some of them will be spilled into stack, too.
> >> Specially for long double/complex double variables that might have padding, shall we specially handle them during that time?
> >
> > Ideally we'd never spill uninitialized variables but definitely the
> > auto-init can cause the need to spill that very
> > value.  That means that auto-init of registers should happen after RA?
>
> All the initialization of registers should happen after RA? Only the ones with “long double/complex double” type that might have padding?

Not sure, it needs some thoughts.  Making sure all pseudos are
initialized after RTL expansion would be another
possibility.

> > One would need to see what
> > IRA/LRA do to uninitialized pseudo uses.  Computing may uninit uses
> > after RA should be possible and
> > hopefully all ISAs allow initializing hard registers without requiring
> > a scratch register (heh - I'm almost sure
> > there will be exceptions …)
> Need more study here….

Definitely.

> Qing
> >
> > Richard.
> >
> >> Qing
> >>>
> >>> Richard.
> >>>
> >>>> Thanks.
> >>>>
> >>>> Qing
> >>>>
> >>>>>>
> >>>>>>>
> >>>>>>> I'm going to simply XFAIL this testcase for GCC 12 and opened
> >>>>>>> PR105259 so we can revisit the above for GCC 13.
> >>>>>>
> >>>>>> I can further study this bug and try to come up with a good solution in gcc13.
> >>>>>>
> >>>>>> Thank you!
> >>>>>>
> >>>>>> Qing
> >>>>>>>
> >>>>>>> Richard.
> >>>>>>>
> >>>>>>>> Qing
>
diff mbox series

Patch

--- gcc/fold-const.h.jj	2022-02-07 21:26:50.717616208 +0100
+++ gcc/fold-const.h	2022-02-15 01:16:14.509617954 +0100
@@ -36,6 +36,7 @@  extern int native_encode_expr (const_tre
 extern int native_encode_initializer (tree, unsigned char *, int,
 				      int off = -1, unsigned char * = nullptr);
 extern tree native_interpret_expr (tree, const unsigned char *, int);
+extern tree native_interpret_real (tree, const unsigned char *, int);
 extern bool can_native_interpret_type_p (tree);
 extern tree native_interpret_aggregate (tree, const unsigned char *, int, int);
 extern tree find_bitfield_repr_type (int, int);
--- gcc/fold-const.cc.jj	2022-02-09 22:15:31.805466094 +0100
+++ gcc/fold-const.cc	2022-02-15 01:36:11.988496438 +0100
@@ -8643,7 +8643,7 @@  native_interpret_fixed (tree type, const
    the buffer PTR of length LEN as a REAL_CST of type TYPE.
    If the buffer cannot be interpreted, return NULL_TREE.  */
 
-static tree
+tree
 native_interpret_real (tree type, const unsigned char *ptr, int len)
 {
   scalar_float_mode mode = SCALAR_FLOAT_TYPE_MODE (type);
@@ -8694,19 +8694,7 @@  native_interpret_real (tree type, const
     }
 
   real_from_target (&r, tmp, mode);
-  tree ret = build_real (type, r);
-  if (MODE_COMPOSITE_P (mode))
-    {
-      /* For floating point values in composite modes, punt if this folding
-	 doesn't preserve bit representation.  As the mode doesn't have fixed
-	 precision while GCC pretends it does, there could be valid values that
-	 GCC can't really represent accurately.  See PR95450.  */
-      unsigned char buf[24];
-      if (native_encode_expr (ret, buf, total_bytes, 0) != total_bytes
-	  || memcmp (ptr, buf, total_bytes) != 0)
-	ret = NULL_TREE;
-    }
-  return ret;
+  return build_real (type, r);
 }
 
 
@@ -8824,7 +8812,23 @@  native_interpret_expr (tree type, const
       return native_interpret_int (type, ptr, len);
 
     case REAL_TYPE:
-      return native_interpret_real (type, ptr, len);
+      if (tree ret = native_interpret_real (type, ptr, len))
+	{
+	  /* For floating point values in composite modes, punt if this
+	     folding doesn't preserve bit representation.  As the mode doesn't
+	     have fixed precision while GCC pretends it does, there could be
+	     valid values that GCC can't really represent accurately.
+	     See PR95450.  Even for other modes, e.g. x86 XFmode can have some
+	     bit combinationations which GCC doesn't preserve.  */
+	  unsigned char buf[24];
+	  scalar_float_mode mode = SCALAR_FLOAT_TYPE_MODE (type);
+	  int total_bytes = GET_MODE_SIZE (mode);
+	  if (native_encode_expr (ret, buf, total_bytes, 0) != total_bytes
+	      || memcmp (ptr, buf, total_bytes) != 0)
+	    return NULL_TREE;
+	  return ret;
+	}
+      return NULL_TREE;
 
     case FIXED_POINT_TYPE:
       return native_interpret_fixed (type, ptr, len);
--- gcc/gimple-fold.cc.jj	2022-02-11 21:59:57.177698154 +0100
+++ gcc/gimple-fold.cc	2022-02-15 01:21:08.706649224 +0100
@@ -4807,10 +4807,10 @@  clear_padding_type (clear_padding_struct
 	clear_padding_flush (buf, false);
       if (clear_padding_real_needs_padding_p (type))
 	{
-	  /* Use native_interpret_expr + native_encode_expr to figure out
+	  /* Use native_interpret_real + native_encode_expr to figure out
 	     which bits are padding.  */
 	  memset (buf->buf + buf->size, ~0, sz);
-	  tree cst = native_interpret_expr (type, buf->buf + buf->size, sz);
+	  tree cst = native_interpret_real (type, buf->buf + buf->size, sz);
 	  gcc_assert (cst && TREE_CODE (cst) == REAL_CST);
 	  int len = native_encode_expr (cst, buf->buf + buf->size, sz);
 	  gcc_assert (len > 0 && (size_t) len == (size_t) sz);
--- gcc/simplify-rtx.cc.jj	2022-01-20 21:24:40.611955521 +0100
+++ gcc/simplify-rtx.cc	2022-02-14 22:14:29.258718212 +0100
@@ -7302,7 +7302,7 @@  simplify_immed_subreg (fixed_size_mode o
   else if (!native_encode_rtx (innermode, x, buffer, first_byte, inner_bytes))
     return NULL_RTX;
   rtx ret = native_decode_rtx (outermode, buffer, 0);
-  if (ret && MODE_COMPOSITE_P (outermode))
+  if (ret && FLOAT_MODE_P (outermode))
     {
       auto_vec<target_unit, 128> buffer2 (buffer_bytes);
       if (!native_encode_rtx (outermode, ret, buffer2, 0, buffer_bytes))
--- gcc/testsuite/gcc.dg/pr104522.c.jj	2022-02-14 22:14:29.258718212 +0100
+++ gcc/testsuite/gcc.dg/pr104522.c	2022-02-14 22:14:29.258718212 +0100
@@ -0,0 +1,14 @@ 
+/* PR middle-end/104522 */
+/* { dg-do compile } */
+/* { dg-options "-O -fcompare-debug -dP" } */
+
+typedef short __attribute__((__vector_size__(16))) V;
+long double x;
+
+void
+foo (void)
+{
+  V t = { 512, 0, 0, 0, 16384 };
+  long double u = *(long double *) &t;
+  x /= u;
+}