diff mbox

PATCH: PR middle-end/45678: [4.4/4.5/4.6 Regression] crash on vector code with -m32 -msse

Message ID AANLkTikVRyKD9kRbF_unro1qHUOywvCx-exG4bYd-ESC@mail.gmail.com
State New
Headers show

Commit Message

H.J. Lu Sept. 17, 2010, 9:31 p.m. UTC
On Fri, Sep 17, 2010 at 1:03 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Sep 17, 2010 at 12:18 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Fri, Sep 17, 2010 at 11:32 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Fri, Sep 17, 2010 at 11:15 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>>>> On Fri, Sep 17, 2010 at 10:46:00AM -0700, Richard Henderson wrote:
>>>>> On 09/17/2010 09:58 AM, H.J. Lu wrote:
>>>>> >     PR middle-end/45678
>>>>> >     * cfgexpand.c (update_stack_alignment): New.
>>>>> >     (get_decl_align_unit): Use it.
>>>>> >     (expand_one_stack_var_at): Call update_stack_alignment.
>>>>>
>>>>> Ok.
>>>>
>>>> I don't deny this fixes the problem and probably is a good idea in any case,
>>>> the question I have is just if this won't pessimize i?86 code too much,
>>>> especially with non-default options where incoming stack boundary is smaller
>>>> than PREFERRED_STACK_BOUNDARY.
>>>> The thing is that this expand_one_stack_var_at is just an optimization,
>>>> but we don't know if anything will actually make use of the increased
>>>> alignment.  If there will be just variables that need 32-bit alignment
>>>> and say incoming stack alignment is also 32-bit, but preferred stack
>>>> boundary is 128-bit, as soon as some variable will be placed in slot 16
>>>> bytes from the base, we'll force realignment even when nothing actually
>>>> needs it.
>>>>
>>>> So, I wonder if
>>>>      max_align = MAX (crtl->max_used_stack_slot_alignment,
>>>>                       PREFERRED_STACK_BOUNDARY);
>>>> shouldn't be replaced by something like:
>>>> 1)
>>>>      max_align = MAX (crtl->max_used_stack_slot_alignment,
>>>>                       SUPPORTS_STACK_ALIGNMENT
>>>>                       ? STACK_BOUNDARY : PREFERRED_STACK_BOUNDARY);
>>>> or:
>>>> 2)
>>>>      max_align = MAX (crtl->max_used_stack_slot_alignment,
>>>>                       SUPPORTS_STACK_ALIGNMENT
>>>>                       ? INCOMING_STACK_BOUNDARY : PREFERRED_STACK_BOUNDARY);
>>>> or:
>>>> 3)
>>>>      max_align = SUPPORTS_STACK_ALIGNMENT
>>>>                  ? MAX (crtl->max_used_stack_slot_alignment, STACK_BOUNDARY)
>>>>                  : PREFERRED_STACK_BOUNDARY;
>>>> or:
>>>> 4)
>>>>      max_align = SUPPORTS_STACK_ALIGNMENT
>>>>                  ? MAX (crtl->max_used_stack_slot_alignment,
>>>>                         INCOMING_STACK_BOUNDARY)
>>>>                  : PREFERRED_STACK_BOUNDARY;
>>>> For 2) or 4) we'd need to do
>>>>  if (targetm.calls.update_stack_boundary)
>>>>    targetm.calls.update_stack_boundary ();
>>>> earlier (before all the expand_one_stack_var_at calls).
>>>> 1) and 3) are conservative choices for i?86, seems without H.J's patch gcc
>>>> would often happily use just 32-bit alignment of the stack base and even if
>>>> incoming stack boundary was 128-bit, it would tell other routines they don't
>>>> need to bother with that alignment, but e.g. on x86-64 it would still assume
>>>> max_align of 8 bytes initially, even when e.g. x86-64 doesn't allow setting
>>>> PREFERRED_STACK_BOUNDARY below 128 bits.  I think usually it doesn't buy us
>>>> too much to align below INCOMING_STACK_BOUNDARY.  So perhaps 2)/4), where
>>>> needlessly bumping alignment in expand_one_stack_var_at too much would be
>>>> far more costly (would cause stack realignment).
>>>> The 1)/2) vs. 3)/4) difference is for !SUPPORTS_STACK_ALIGNMENT targets,
>>>> I wonder as they are never realigning the stack whether we should ever
>>>> use there higher alignment.
>>>>
>>>
>>> I think it is related to
>>>
>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44542
>>>
>>> One of my attempts may work.
>>>
>>> FWIW, I am fine with any solution as long as when we do
>>>
>>> DECL_ALIGN (decl) = align;
>>>
>>> we guarantee that it will be aligned at that boundary at run-time.
>>
>> I don't think we should use offset as alignment unless the variable
>> should be aligned at offset bytes.  My patch
>>
>> http://gcc.gnu.org/bugzilla/attachment.cgi?id=20922&action=view
>>
>> may align the local variable at boundary specified for the variable.
>>
>>
>>
>> --
>> H.J.
>>
>
> Using stack offset for local variable alignment is a bad idea. My patch
> caused:
>
> FAIL: gcc.target/i386/incoming-9.c scan-assembler-not andl[\\t
> ]*\\$-16,[\\t ]*%esp

Here is a patch to align local variable with specified alignment.
There is no need to call update_stack_alignment in
expand_one_stack_var_at since we don't use its offset
to increase is alignment.   OK for trunk if there are regressions?

Thanks.

Comments

H.J. Lu Sept. 17, 2010, 9:34 p.m. UTC | #1
On Fri, Sep 17, 2010 at 2:31 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Sep 17, 2010 at 1:03 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Fri, Sep 17, 2010 at 12:18 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Fri, Sep 17, 2010 at 11:32 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>> On Fri, Sep 17, 2010 at 11:15 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>>>>> On Fri, Sep 17, 2010 at 10:46:00AM -0700, Richard Henderson wrote:
>>>>>> On 09/17/2010 09:58 AM, H.J. Lu wrote:
>>>>>> >     PR middle-end/45678
>>>>>> >     * cfgexpand.c (update_stack_alignment): New.
>>>>>> >     (get_decl_align_unit): Use it.
>>>>>> >     (expand_one_stack_var_at): Call update_stack_alignment.
>>>>>>
>>>>>> Ok.
>>>>>
>>>>> I don't deny this fixes the problem and probably is a good idea in any case,
>>>>> the question I have is just if this won't pessimize i?86 code too much,
>>>>> especially with non-default options where incoming stack boundary is smaller
>>>>> than PREFERRED_STACK_BOUNDARY.
>>>>> The thing is that this expand_one_stack_var_at is just an optimization,
>>>>> but we don't know if anything will actually make use of the increased
>>>>> alignment.  If there will be just variables that need 32-bit alignment
>>>>> and say incoming stack alignment is also 32-bit, but preferred stack
>>>>> boundary is 128-bit, as soon as some variable will be placed in slot 16
>>>>> bytes from the base, we'll force realignment even when nothing actually
>>>>> needs it.
>>>>>
>>>>> So, I wonder if
>>>>>      max_align = MAX (crtl->max_used_stack_slot_alignment,
>>>>>                       PREFERRED_STACK_BOUNDARY);
>>>>> shouldn't be replaced by something like:
>>>>> 1)
>>>>>      max_align = MAX (crtl->max_used_stack_slot_alignment,
>>>>>                       SUPPORTS_STACK_ALIGNMENT
>>>>>                       ? STACK_BOUNDARY : PREFERRED_STACK_BOUNDARY);
>>>>> or:
>>>>> 2)
>>>>>      max_align = MAX (crtl->max_used_stack_slot_alignment,
>>>>>                       SUPPORTS_STACK_ALIGNMENT
>>>>>                       ? INCOMING_STACK_BOUNDARY : PREFERRED_STACK_BOUNDARY);
>>>>> or:
>>>>> 3)
>>>>>      max_align = SUPPORTS_STACK_ALIGNMENT
>>>>>                  ? MAX (crtl->max_used_stack_slot_alignment, STACK_BOUNDARY)
>>>>>                  : PREFERRED_STACK_BOUNDARY;
>>>>> or:
>>>>> 4)
>>>>>      max_align = SUPPORTS_STACK_ALIGNMENT
>>>>>                  ? MAX (crtl->max_used_stack_slot_alignment,
>>>>>                         INCOMING_STACK_BOUNDARY)
>>>>>                  : PREFERRED_STACK_BOUNDARY;
>>>>> For 2) or 4) we'd need to do
>>>>>  if (targetm.calls.update_stack_boundary)
>>>>>    targetm.calls.update_stack_boundary ();
>>>>> earlier (before all the expand_one_stack_var_at calls).
>>>>> 1) and 3) are conservative choices for i?86, seems without H.J's patch gcc
>>>>> would often happily use just 32-bit alignment of the stack base and even if
>>>>> incoming stack boundary was 128-bit, it would tell other routines they don't
>>>>> need to bother with that alignment, but e.g. on x86-64 it would still assume
>>>>> max_align of 8 bytes initially, even when e.g. x86-64 doesn't allow setting
>>>>> PREFERRED_STACK_BOUNDARY below 128 bits.  I think usually it doesn't buy us
>>>>> too much to align below INCOMING_STACK_BOUNDARY.  So perhaps 2)/4), where
>>>>> needlessly bumping alignment in expand_one_stack_var_at too much would be
>>>>> far more costly (would cause stack realignment).
>>>>> The 1)/2) vs. 3)/4) difference is for !SUPPORTS_STACK_ALIGNMENT targets,
>>>>> I wonder as they are never realigning the stack whether we should ever
>>>>> use there higher alignment.
>>>>>
>>>>
>>>> I think it is related to
>>>>
>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44542
>>>>
>>>> One of my attempts may work.
>>>>
>>>> FWIW, I am fine with any solution as long as when we do
>>>>
>>>> DECL_ALIGN (decl) = align;
>>>>
>>>> we guarantee that it will be aligned at that boundary at run-time.
>>>
>>> I don't think we should use offset as alignment unless the variable
>>> should be aligned at offset bytes.  My patch
>>>
>>> http://gcc.gnu.org/bugzilla/attachment.cgi?id=20922&action=view
>>>
>>> may align the local variable at boundary specified for the variable.
>>>
>>>
>>>
>>> --
>>> H.J.
>>>
>>
>> Using stack offset for local variable alignment is a bad idea. My patch
>> caused:
>>
>> FAIL: gcc.target/i386/incoming-9.c scan-assembler-not andl[\\t
>> ]*\\$-16,[\\t ]*%esp
>
> Here is a patch to align local variable with specified alignment.
> There is no need to call update_stack_alignment in
> expand_one_stack_var_at since we don't use its offset
> to increase is alignment.   OK for trunk if there are regressions?

I meant "OK for trunk if there are NO regressions?"

H.J.
--
> Thanks.
>
>
> --
> H.J.
> ---gcc/
>
> 2010-09-17  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR target/44542
>        PR middle-end/45678
>        * cfgexpand.c (expand_one_stack_var_at): Take an alignment
>        parameter and use it for alignment.  Don't call
>        update_stack_alignment.
>        (expand_stack_vars): Call expand_one_stack_var_at with specified
>        alignment.
>        (expand_one_stack_var): Likewise.
>
> gcc/testsuite/
>
> 2010-09-17  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR middle-end/45678
>        * gcc.dg/torture/pr45678-3.c: New.
> -
>
Jakub Jelinek Sept. 17, 2010, 9:40 p.m. UTC | #2
On Fri, Sep 17, 2010 at 02:31:35PM -0700, H.J. Lu wrote:
> > Using stack offset for local variable alignment is a bad idea. My patch
> > caused:

It isn't a bad idea, as long as the base of the stack slots is going to be
as aligned as the code expects.  Then it is a nice optimization, which you
are killing in your patch.  All that is IMHO needed is to be a little bit
more conservative in the estimations.  Haven't tested any of the
possibilities I've listed, but I bet all of them would fix
> >
> > FAIL: gcc.target/i386/incoming-9.c scan-assembler-not andl[\\t
> > ]*\\$-16,[\\t ]*%esp

this.

	Jakub
H.J. Lu Sept. 17, 2010, 11:55 p.m. UTC | #3
On Fri, Sep 17, 2010 at 2:40 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Sep 17, 2010 at 02:31:35PM -0700, H.J. Lu wrote:
>> > Using stack offset for local variable alignment is a bad idea. My patch
>> > caused:
>
> It isn't a bad idea, as long as the base of the stack slots is going to be
> as aligned as the code expects.  Then it is a nice optimization, which you
> are killing in your patch.  All that is IMHO needed is to be a little bit
> more conservative in the estimations.  Haven't tested any of the
> possibilities I've listed, but I bet all of them would fix
>

Personally,  I think it is a bad way to optimize.  If a local variable
has better performance aligned at 8 byte, we should align it
at 8byte explicitly. Align it at 4 byte and hope it will be aligned
at 8byte at run-time isn't a good optimization to me.
diff mbox

Patch

diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 1e67e77..684032c 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -715,13 +715,14 @@  dump_stack_var_partition (void)
     }
 }
 
-/* Assign rtl to DECL at frame offset OFFSET.  */
+/* Assign rtl to DECL at frame offset OFFSET with alignment ALIGN.  */
 
 static void
-expand_one_stack_var_at (tree decl, HOST_WIDE_INT offset)
+expand_one_stack_var_at (tree decl, HOST_WIDE_INT offset,
+			 unsigned HOST_WIDE_INT align)
 {
   /* Alignment is unsigned.   */
-  unsigned HOST_WIDE_INT align, max_align;
+  unsigned HOST_WIDE_INT max_align;
   rtx x;
 
   /* If this fails, we've overflowed the stack frame.  Error nicely?  */
@@ -735,15 +736,11 @@  expand_one_stack_var_at (tree decl, HOST_WIDE_INT offset)
       /* Set alignment we actually gave this decl if it isn't an SSA name.
          If it is we generate stack slots only accidentally so it isn't as
 	 important, we'll simply use the alignment that is already set.  */
-      offset -= frame_phase;
-      align = offset & -offset;
-      align *= BITS_PER_UNIT;
       max_align = MAX (crtl->max_used_stack_slot_alignment,
 		       PREFERRED_STACK_BOUNDARY);
       if (align == 0 || align > max_align)
 	align = max_align;
 
-      update_stack_alignment (align);
       DECL_ALIGN (decl) = align;
       DECL_USER_ALIGN (decl) = 0;
     }
@@ -792,7 +789,8 @@  expand_stack_vars (bool (*pred) (tree))
 	{
 	  gcc_assert (stack_vars[j].offset <= stack_vars[i].size);
 	  expand_one_stack_var_at (stack_vars[j].decl,
-				   stack_vars[j].offset + offset);
+				   stack_vars[j].offset + offset,
+				   stack_vars[j].alignb * BITS_PER_UNIT);
 	}
     }
 }
@@ -825,13 +823,15 @@  account_stack_vars (void)
 static void
 expand_one_stack_var (tree var)
 {
-  HOST_WIDE_INT size, offset, align;
+  HOST_WIDE_INT size, offset;
+  /* Alignment is unsigned.   */
+  unsigned HOST_WIDE_INT align;
 
   size = tree_low_cst (DECL_SIZE_UNIT (SSAVAR (var)), 1);
   align = get_decl_align_unit (SSAVAR (var));
   offset = alloc_stack_frame_space (size, align);
 
-  expand_one_stack_var_at (var, offset);
+  expand_one_stack_var_at (var, offset, align * BITS_PER_UNIT);
 }
 
 /* A subroutine of expand_one_var.  Called to assign rtl to a VAR_DECL
--- /dev/null	2010-09-09 09:16:30.485584932 -0700
+++ gcc/gcc/testsuite/gcc.dg/torture/pr45678-3.c	2010-09-17 14:16:34.655001231 -0700
@@ -0,0 +1,19 @@ 
+typedef float v4sf __attribute__ ((vector_size (16)));
+
+extern void abort (void);
+
+int main(void)
+{
+  float m4[4];
+  v4sf m3;
+  float m2[4] = {4, 3, 2, 1};
+  v4sf one2 = {5, 15, 25, 35};
+
+  __builtin_memcpy (&m3, &m2, sizeof(m2));
+  m3 = m3 * one2;
+  __builtin_memcpy (&m4, &m3, sizeof(m4));
+  if (m4[0] != 20)
+    abort ();
+
+  return 0;
+}