diff mbox

[i386,tuning] Generate 128-bit AVX by default for bdver1

Message ID D4C76825A6780047854A11E93CDE84D004D3054E8E@SAUSEXMBP01.amd.com
State New
Headers show

Commit Message

Fang, Changpeng Feb. 11, 2011, 12:20 a.m. UTC
Hi, 

 Attached is the patch to force gcc to generate 128-bit avx instructions for bdver1. We found that for
the current Bulldozer processors, AVX128 performs better than AVX256. For example, AVX128 is 3%
faster than AVX256 on CFP2006, and 2~3% faster than AVX256 on polyhedron.

As a result, we prefer gcc 4.6 to generate 128-bit avx instructions only (for bdver1).

The patch passed bootstrapping on x86_64-unknown-linux-gnu with "-O3 -g -march=bdver1" and
the necessary correctness and performance.

Is it OK to commit to trunk?

Thanks,

Changpeng

Comments

Richard Biener Feb. 11, 2011, 9:46 a.m. UTC | #1
On Thu, 10 Feb 2011, Fang, Changpeng wrote:

> Hi, 
> 
>  Attached is the patch to force gcc to generate 128-bit avx instructions for bdver1. We found that for
> the current Bulldozer processors, AVX128 performs better than AVX256. For example, AVX128 is 3%
> faster than AVX256 on CFP2006, and 2~3% faster than AVX256 on polyhedron.
> 
> As a result, we prefer gcc 4.6 to generate 128-bit avx instructions only (for bdver1).
> 
> The patch passed bootstrapping on x86_64-unknown-linux-gnu with "-O3 -g -march=bdver1" and
> the necessary correctness and performance.
> 
> Is it OK to commit to trunk?

I think there was no attempt to tune anything for AVX256, in particular
the vectorizer cost model may be completely off.  HJ and Andi also
hinted at some alignment problems (at least SB seems to have a large
penalty when loads cross a cacheline boundary).  So - did you do any
investigation on why 256bit vectors are slower for you?  Are these
cases that the cost model could easily catch?

Thanks,
Richard.
Andrew Pinski Feb. 11, 2011, 6:42 p.m. UTC | #2
On Fri, Feb 11, 2011 at 1:46 AM, Richard Guenther <rguenther@suse.de> wrote:
>>  Attached is the patch to force gcc to generate 128-bit avx instructions for bdver1. We found that for
>> the current Bulldozer processors, AVX128 performs better than AVX256. For example, AVX128 is 3%
>> faster than AVX256 on CFP2006, and 2~3% faster than AVX256 on polyhedron.
>>
>> As a result, we prefer gcc 4.6 to generate 128-bit avx instructions only (for bdver1).
>>
>> The patch passed bootstrapping on x86_64-unknown-linux-gnu with "-O3 -g -march=bdver1" and
>> the necessary correctness and performance.
>>
>> Is it OK to commit to trunk?
>
> I think there was no attempt to tune anything for AVX256, in particular
> the vectorizer cost model may be completely off.  HJ and Andi also
> hinted at some alignment problems (at least SB seems to have a large
> penalty when loads cross a cacheline boundary).  So - did you do any
> investigation on why 256bit vectors are slower for you?  Are these
> cases that the cost model could easily catch?


IIRC from reading about bdver1 is that AVX256 was emulated by
splitting them up into two different AVX128 instructions which
obviously will be slower in some cases.

-- Pinski
Fang, Changpeng Feb. 14, 2011, 6:02 p.m. UTC | #3
On Fri, Feb 11, 2011 at 1:46 AM, Richard Guenther <rguenther@suse.de> wrote:
>>  Attached is the patch to force gcc to generate 128-bit avx instructions for bdver1. We found that for
>> the current Bulldozer processors, AVX128 performs better than AVX256. For example, AVX128 is 3%
>> faster than AVX256 on CFP2006, and 2~3% faster than AVX256 on polyhedron.
>>
>> As a result, we prefer gcc 4.6 to generate 128-bit avx instructions only (for bdver1).
>>
>> The patch passed bootstrapping on x86_64-unknown-linux-gnu with "-O3 -g -march=bdver1" and
>> the necessary correctness and performance.
>>
>> Is it OK to commit to trunk?
>
> I think there was no attempt to tune anything for AVX256, in particular
> the vectorizer cost model may be completely off.  HJ and Andi also
> hinted at some alignment problems (at least SB seems to have a large
> penalty when loads cross a cacheline boundary).  So - did you do any
> investigation on why 256bit vectors are slower for you?  Are these
> cases that the cost model could easily catch?


>IIRC from reading about bdver1 is that AVX256 was emulated by
>splitting them up into two different AVX128 instructions which
>obviously will be slower in some cases.

Yes, this should be the major reason that avx256 is slower. And HJ's patch that splitting unaligned 256-bit
load/store does not help.

We plan for gcc 4.6 to generate 128-bit avx for bdver1. It is true that we should tune the vectorizer for avx256
and avx128, but I am afraid that should be done in 4.7 frame.

Thanks,

Changpeng
diff mbox

Patch

From b2587889e4c8016f8bc4dde53fa0d59c1a9074da Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@houghton.(none)>
Date: Thu, 10 Feb 2011 16:11:55 -0800
Subject: [PATCH] Generate 128-bit AVX instructions by default for bdver1

	* config/i386/i386.h (enum ix86_tune_indices): Introduce
	X86_PREFER_AVX128 feature entry.
	(ix86_tune_features): Define TARGET_PREFER_AVX128.

	* config/i386/i386.c (initial_ix86_tune_features): Set
	X86_PREFER_AVX128 for bdver1.
	(ix86_preferred_simd_mode): Set the appropriate modes when
	X86_PREFER_AVX128 is set (for bdver1).
---
 gcc/config/i386/i386.c |    7 +++++--
 gcc/config/i386/i386.h |    3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 12c7062..5c8346e 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2082,6 +2082,9 @@  static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
   /* X86_TUNE_VECTORIZE_DOUBLE: Enable double precision vector
      instructions.  */
   ~m_ATOM,
+
+  /* X86_PREFER_AVX128: Generate AVX 128 instead of AVX 256.  */
+  m_BDVER1,
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -34698,9 +34701,9 @@  ix86_preferred_simd_mode (enum machine_mode mode)
   switch (mode)
     {
     case SFmode:
-      return TARGET_AVX ? V8SFmode : V4SFmode;
+      return TARGET_AVX ? (TARGET_PREFER_AVX128 ? V4SFmode : V8SFmode) : V4SFmode;
     case DFmode:
-      return TARGET_AVX ? V4DFmode : V2DFmode;
+      return TARGET_AVX ? (TARGET_PREFER_AVX128 ? V2DFmode : V4DFmode) : V2DFmode;
     case DImode:
       return V2DImode;
     case SImode:
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index f14a95d..b84e6ed 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -322,6 +322,7 @@  enum ix86_tune_indices {
   X86_TUNE_FUSE_CMP_AND_BRANCH,
   X86_TUNE_OPT_AGU,
   X86_TUNE_VECTORIZE_DOUBLE,
+  X86_PREFER_AVX128,
 
   X86_TUNE_LAST
 };
@@ -418,6 +419,8 @@  extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 #define TARGET_OPT_AGU ix86_tune_features[X86_TUNE_OPT_AGU]
 #define TARGET_VECTORIZE_DOUBLE \
 	ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
+#define TARGET_PREFER_AVX128 \
+	ix86_tune_features[X86_PREFER_AVX128]
 
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
-- 
1.6.3.3