Message ID | CAFULd4bqaCZcJZmqZ9Cj=5vUJzofKJWg2hDxCm=-2g6yte66zQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
> When compiled with -m32 -mavx, we get: > > .align 32 > .type a, @object > .size a, 32 > a: > > so, the alignment was already raised elsewhere. We get .align 16 for > -msse -m32 when vectorizing. > > without -msse (and consequently without vectorizing), we get for -m32: > > .align 4 > .type a, @object > .size a, 32 > a: > > which corresponds to 32bit ABI rules (we still get .align16 for 64bit ABI). Yes, but the issue is that it was 32 before so the alignment decrease is weird.
On Fri, Jan 3, 2014 at 12:20 PM, Eric Botcazou <ebotcazou@adacore.com> wrote: >> When compiled with -m32 -mavx, we get: >> >> .align 32 >> .type a, @object >> .size a, 32 >> a: >> >> so, the alignment was already raised elsewhere. We get .align 16 for >> -msse -m32 when vectorizing. >> >> without -msse (and consequently without vectorizing), we get for -m32: >> >> .align 4 >> .type a, @object >> .size a, 32 >> a: >> >> which corresponds to 32bit ABI rules (we still get .align16 for 64bit ABI). > > Yes, but the issue is that it was 32 before so the alignment decrease is weird. Yes, but this change is benign, and as shown, I don't think we need this functionality at all. The data from other TUs is accessed with a 4 byte alignment (on 32bit targets, and unfortunately also on 64bit targets), and the alignment of local data is increased elsewhere. I am testing a patch that removes "max_align" part from ix86_data_alignment. Uros.
On Fri, Jan 03, 2014 at 12:25:00PM +0100, Uros Bizjak wrote:
> I am testing a patch that removes "max_align" part from ix86_data_alignment.
That looks like unnecessary pessimization. Note the hunk in question is
guarded with opt, which means it is an optimization rather than ABI issue,
it can increase alignment, but the compiler can only assume the increased
alignment if the symbol is not public or if it is public, but can't be
preempted by another TU's definition. Even in that case it can be
worthwhile to increase the alignment, say if doing versioning for alignment,
or say just doing some AVX/AVX2/AVX512F version of memcpy/memset that can
handle it faster if it is sufficiently aligned by testing it at runtime.
So, IMHO we shouldn't drop it, just improve it.
Perhaps:
int max_align = optimize_size ? BITS_PER_WORD
: MIN (512, MAX_OFILE_ALIGNMENT);
if (opt
&& AGGREGATE_TYPE_P (type)
&& TYPE_SIZE (type)
&& TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
&& align < max_align)
{
int this_align = 256;
for (this_align = 256; this_align <= max_align; this_align *= 2)
if (TREE_INT_CST_LOW (TYPE_SIZE (type)) < (unsigned) this_align
&& !TREE_INT_CST_HIGH (TYPE_SIZE (type)))
break;
else if (align < this_align)
align = this_align;
}
which will handle both the 256 bit alignment for >= 256 bit objects,
512 bit alignment for >= 512 bit objects and will be prepared for the
future.
128 bit I think doesn't need to be handled, DATA_ALIGNMENT has been
using 256-bit test already since it's introduction in 1998:
http://gcc.gnu.org/ml/gcc-bugs/1998-03/msg00011.html
Jakub
On Fri, Jan 3, 2014 at 12:59 PM, Jakub Jelinek <jakub@redhat.com> wrote: > On Fri, Jan 03, 2014 at 12:25:00PM +0100, Uros Bizjak wrote: >> I am testing a patch that removes "max_align" part from ix86_data_alignment. > > That looks like unnecessary pessimization. Note the hunk in question is > guarded with opt, which means it is an optimization rather than ABI issue, > it can increase alignment, but the compiler can only assume the increased > alignment if the symbol is not public or if it is public, but can't be > preempted by another TU's definition. Even in that case it can be > worthwhile to increase the alignment, say if doing versioning for alignment, > or say just doing some AVX/AVX2/AVX512F version of memcpy/memset that can > handle it faster if it is sufficiently aligned by testing it at runtime. > > So, IMHO we shouldn't drop it, just improve it. > Perhaps: > > int max_align = optimize_size ? BITS_PER_WORD > : MIN (512, MAX_OFILE_ALIGNMENT); > > if (opt > && AGGREGATE_TYPE_P (type) > && TYPE_SIZE (type) > && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST > && align < max_align) > { > int this_align = 256; > for (this_align = 256; this_align <= max_align; this_align *= 2) > if (TREE_INT_CST_LOW (TYPE_SIZE (type)) < (unsigned) this_align > && !TREE_INT_CST_HIGH (TYPE_SIZE (type))) > break; > else if (align < this_align) > align = this_align; > } > > which will handle both the 256 bit alignment for >= 256 bit objects, > 512 bit alignment for >= 512 bit objects and will be prepared for the > future. > > 128 bit I think doesn't need to be handled, DATA_ALIGNMENT has been > using 256-bit test already since it's introduction in 1998: > http://gcc.gnu.org/ml/gcc-bugs/1998-03/msg00011.html Thanks for the pointer, there is indeed the recommendation in optimization manual [1], section 3.6.4, where it is said: --quote-- Misaligned data access can incur significant performance penalties. This is particularly true for cache line splits. The size of a cache line is 64 bytes in the Pentium 4 and other recent Intel processors, including processors based on Intel Core microarchitecture. An access to data unaligned on 64-byte boundary leads to two memory accesses and requires several μops to be executed (instead of one). Accesses that span 64-byte boundaries are likely to incur a large performance penalty, the cost of each stall generally are greater on machines with longer pipelines. ... A 64-byte or greater data structure or array should be aligned so that its base address is a multiple of 64. Sorting data in decreasing size order is one heuristic for assisting with natural alignment. As long as 16- byte boundaries (and cache lines) are never crossed, natural alignment is not strictly necessary (though it is an easy way to enforce this). --/quote-- So, this part has nothing to do with AVX512, but with cache line width. And we do have a --param "l1-cache-line-size=64", detected with -march=native that could come handy here. This part should be rewritten (and commented) with the information above in mind. [1] http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html Uros.
Index: i386.c =================================================================== --- i386.c (revision 206311) +++ i386.c (working copy) @@ -26465,6 +26465,7 @@ int ix86_data_alignment (tree type, int align, bool opt) { +#if 0 int max_align = optimize_size ? BITS_PER_WORD : MIN (512, MAX_OFILE_ALIGNMENT); @@ -26476,6 +26477,7 @@ || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < max_align) align = max_align; +#endif /* x86-64 ABI requires arrays greater than 16 bytes to be aligned to 16byte boundary. */