diff mbox

[i386,5/8,AVX-512] Extend vectorizer hooks.

Message ID CAFULd4bqaCZcJZmqZ9Cj=5vUJzofKJWg2hDxCm=-2g6yte66zQ@mail.gmail.com
State New
Headers show

Commit Message

Uros Bizjak Jan. 3, 2014, 11:02 a.m. UTC
On Thu, Jan 2, 2014 at 11:18 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> Note that it has unexpected side-effects: previously, in 32-bit mode,
>> 256-bit aggregate objects would have been given 256-bit alignment; now,
>> they will fall back to default alignment, for example 32-bit only.
>
> In case this wasn't clear enough, just compile in 32-bit mode:
>
> int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8};

It looks to me that we don't need to adjust anything with max_align.
Using following patch:

--cut here--
--cut here--

and following testcase:

-- cut here--
float a[8] = { 1, 2, 3, 4, 5, 6, 7, 8};

extern float z[8];

void t (void)
{
  int i;

  for (i = 0; i < 8; i++)
    z[i] = z[i] + a[i];
}
--cut here--

When compiled with -m32 -mavx, we get:

        .align 32
        .type   a, @object
        .size   a, 32
a:

so, the alignment was already raised elsewhere. We get .align 16 for
-msse -m32 when vectorizing.

without -msse (and consequently without vectorizing), we get for -m32:

        .align 4
        .type   a, @object
        .size   a, 32
a:

which corresponds to 32bit ABI rules (we still get .align16 for 64bit ABI).

What bothers me in this testcase is (unrelated) alignment of z[8]
array. Even for 64bit targets, we get:

#(insn:TI 6 5 8 2 (set (reg:V4SF 21 xmm0 [orig:90 vect__4.5 ] [90])
#        (unspec:V4SF [
#                (mem/c:V4SF (reg/f:DI 0 ax [89]) [2 MEM[(float
*)&z]+0 S16 A32])
#            ] UNSPEC_LOADU)) al.c:10 1195 {*sse_loadups}
#     (nil))
        movups  (%rax), %xmm0   # 6     *sse_loadups    [length = 3]

ABI guarantees 16 byte alignment of z[8], but we fail to communicate
this to the compiler.

Uros.

Comments

Eric Botcazou Jan. 3, 2014, 11:20 a.m. UTC | #1
> When compiled with -m32 -mavx, we get:
> 
>         .align 32
>         .type   a, @object
>         .size   a, 32
> a:
> 
> so, the alignment was already raised elsewhere. We get .align 16 for
> -msse -m32 when vectorizing.
> 
> without -msse (and consequently without vectorizing), we get for -m32:
> 
>         .align 4
>         .type   a, @object
>         .size   a, 32
> a:
> 
> which corresponds to 32bit ABI rules (we still get .align16 for 64bit ABI).

Yes, but the issue is that it was 32 before so the alignment decrease is weird.
Uros Bizjak Jan. 3, 2014, 11:25 a.m. UTC | #2
On Fri, Jan 3, 2014 at 12:20 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> When compiled with -m32 -mavx, we get:
>>
>>         .align 32
>>         .type   a, @object
>>         .size   a, 32
>> a:
>>
>> so, the alignment was already raised elsewhere. We get .align 16 for
>> -msse -m32 when vectorizing.
>>
>> without -msse (and consequently without vectorizing), we get for -m32:
>>
>>         .align 4
>>         .type   a, @object
>>         .size   a, 32
>> a:
>>
>> which corresponds to 32bit ABI rules (we still get .align16 for 64bit ABI).
>
> Yes, but the issue is that it was 32 before so the alignment decrease is weird.

Yes, but this change is benign, and as shown, I don't think we need
this functionality at all. The data from other TUs is accessed with a
4 byte alignment (on 32bit targets, and unfortunately also on 64bit
targets), and the alignment of local data is increased elsewhere.

I am testing a patch that removes "max_align" part from ix86_data_alignment.

Uros.
Jakub Jelinek Jan. 3, 2014, 11:59 a.m. UTC | #3
On Fri, Jan 03, 2014 at 12:25:00PM +0100, Uros Bizjak wrote:
> I am testing a patch that removes "max_align" part from ix86_data_alignment.

That looks like unnecessary pessimization.  Note the hunk in question is
guarded with opt, which means it is an optimization rather than ABI issue,
it can increase alignment, but the compiler can only assume the increased
alignment if the symbol is not public or if it is public, but can't be
preempted by another TU's definition.  Even in that case it can be
worthwhile to increase the alignment, say if doing versioning for alignment,
or say just doing some AVX/AVX2/AVX512F version of memcpy/memset that can
handle it faster if it is sufficiently aligned by testing it at runtime.

So, IMHO we shouldn't drop it, just improve it.
Perhaps:

  int max_align = optimize_size ? BITS_PER_WORD
                                : MIN (512, MAX_OFILE_ALIGNMENT);

  if (opt
      && AGGREGATE_TYPE_P (type)
      && TYPE_SIZE (type)
      && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
      && align < max_align)
    {
      int this_align = 256;
      for (this_align = 256; this_align <= max_align; this_align *= 2)
	if (TREE_INT_CST_LOW (TYPE_SIZE (type)) < (unsigned) this_align
	    && !TREE_INT_CST_HIGH (TYPE_SIZE (type)))
	  break;
	else if (align < this_align)
	  align = this_align;
    }

which will handle both the 256 bit alignment for >= 256 bit objects,
512 bit alignment for >= 512 bit objects and will be prepared for the
future.

128 bit I think doesn't need to be handled, DATA_ALIGNMENT has been
using 256-bit test already since it's introduction in 1998:
http://gcc.gnu.org/ml/gcc-bugs/1998-03/msg00011.html

	Jakub
Uros Bizjak Jan. 3, 2014, 12:27 p.m. UTC | #4
On Fri, Jan 3, 2014 at 12:59 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Jan 03, 2014 at 12:25:00PM +0100, Uros Bizjak wrote:
>> I am testing a patch that removes "max_align" part from ix86_data_alignment.
>
> That looks like unnecessary pessimization.  Note the hunk in question is
> guarded with opt, which means it is an optimization rather than ABI issue,
> it can increase alignment, but the compiler can only assume the increased
> alignment if the symbol is not public or if it is public, but can't be
> preempted by another TU's definition.  Even in that case it can be
> worthwhile to increase the alignment, say if doing versioning for alignment,
> or say just doing some AVX/AVX2/AVX512F version of memcpy/memset that can
> handle it faster if it is sufficiently aligned by testing it at runtime.
>
> So, IMHO we shouldn't drop it, just improve it.
> Perhaps:
>
>   int max_align = optimize_size ? BITS_PER_WORD
>                                 : MIN (512, MAX_OFILE_ALIGNMENT);
>
>   if (opt
>       && AGGREGATE_TYPE_P (type)
>       && TYPE_SIZE (type)
>       && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
>       && align < max_align)
>     {
>       int this_align = 256;
>       for (this_align = 256; this_align <= max_align; this_align *= 2)
>         if (TREE_INT_CST_LOW (TYPE_SIZE (type)) < (unsigned) this_align
>             && !TREE_INT_CST_HIGH (TYPE_SIZE (type)))
>           break;
>         else if (align < this_align)
>           align = this_align;
>     }
>
> which will handle both the 256 bit alignment for >= 256 bit objects,
> 512 bit alignment for >= 512 bit objects and will be prepared for the
> future.
>
> 128 bit I think doesn't need to be handled, DATA_ALIGNMENT has been
> using 256-bit test already since it's introduction in 1998:
> http://gcc.gnu.org/ml/gcc-bugs/1998-03/msg00011.html

Thanks for the pointer, there is indeed the recommendation in
optimization manual [1], section 3.6.4, where it is said:

--quote--
Misaligned data access can incur significant performance penalties.
This is particularly true for cache line
splits. The size of a cache line is 64 bytes in the Pentium 4 and
other recent Intel processors, including
processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory
accesses and requires several
μops to be executed (instead of one). Accesses that span 64-byte
boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on
machines with longer pipelines.

...

A 64-byte or greater data structure or array should be aligned so that
its base address is a multiple of 64.
Sorting data in decreasing size order is one heuristic for assisting
with natural alignment. As long as 16-
byte boundaries (and cache lines) are never crossed, natural alignment
is not strictly necessary (though
it is an easy way to enforce this).
--/quote--

So, this part has nothing to do with AVX512, but with cache line
width. And we do have a --param "l1-cache-line-size=64", detected with
-march=native that could come handy here.

This part should be rewritten (and commented) with the information
above in mind.

[1] http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

Uros.
diff mbox

Patch

Index: i386.c
===================================================================
--- i386.c      (revision 206311)
+++ i386.c      (working copy)
@@ -26465,6 +26465,7 @@ 
 int
 ix86_data_alignment (tree type, int align, bool opt)
 {
+#if 0
   int max_align = optimize_size ? BITS_PER_WORD
                                : MIN (512, MAX_OFILE_ALIGNMENT);

@@ -26476,6 +26477,7 @@ 
          || TREE_INT_CST_HIGH (TYPE_SIZE (type)))
       && align < max_align)
     align = max_align;
+#endif

   /* x86-64 ABI requires arrays greater than 16 bytes to be aligned
      to 16byte boundary.  */