diff mbox series

Disable FMADD in chains for Zen4 and generic

Message ID ZXhwQVQzBiy2hv89@kam.mff.cuni.cz
State New
Headers show
Series Disable FMADD in chains for Zen4 and generic | expand

Commit Message

Jan Hubicka Dec. 12, 2023, 2:37 p.m. UTC
Hi,
this patch disables use of FMA in matrix multiplication loop for generic (for
x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.

For Intel this is neutral both on the matrix multiplication microbenchmark
(attached) and spec2k17 where the difference was within noise for Core.

On core the micro-benchmark runs as follows:

With FMA:

       578,500,241      cycles:u                         #    3.645 GHz                         ( +-  0.12% )
       753,318,477      instructions:u                   #    1.30  insn per cycle              ( +-  0.00% )
       125,417,701      branches:u                       #  790.227 M/sec                       ( +-  0.00% )
          0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )


No FMA:

       577,573,960      cycles:u                         #    3.514 GHz                         ( +-  0.15% )
       878,318,479      instructions:u                   #    1.52  insn per cycle              ( +-  0.00% )
       125,417,702      branches:u                       #  763.035 M/sec                       ( +-  0.00% )
          0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )

So the cycle count is unchanged and discrete multiply+add takes same time as FMA.

While on zen:


With FMA:
         484875179      cycles:u                         #    3.599 GHz                      ( +-  0.05% )  (82.11%)
         752031517      instructions:u                   #    1.55  insn per cycle         
         125106525      branches:u                       #  928.712 M/sec                    ( +-  0.03% )  (85.09%)
            128356      branch-misses:u                  #    0.10% of all branches          ( +-  0.06% )  (83.58%)

No FMA:
         375875209      cycles:u                         #    3.592 GHz                      ( +-  0.08% )  (80.74%)
         875725341      instructions:u                   #    2.33  insn per cycle
         124903825      branches:u                       #    1.194 G/sec                    ( +-  0.04% )  (84.59%)
          0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )

The diffrerence is that Cores understand the fact that fmadd does not need
all three parameters to start computation, while Zen cores doesn't.

Since this seems noticeable win on zen and not loss on Core it seems like good
default for generic.

I plan to commit the patch next week if there are no compplains.

Honza

#include <stdio.h>
#include <time.h>

#define SIZE 1000

float a[SIZE][SIZE];
float b[SIZE][SIZE];
float c[SIZE][SIZE];

void init(void)
{
   int i, j, k;
   for(i=0; i<SIZE; ++i)
   {
      for(j=0; j<SIZE; ++j)
      {
         a[i][j] = (float)i + j;
         b[i][j] = (float)i - j;
         c[i][j] = 0.0f;
      }
   }
}

void mult(void)
{
   int i, j, k;

   for(i=0; i<SIZE; ++i)
   {
      for(j=0; j<SIZE; ++j)
      {  
         for(k=0; k<SIZE; ++k)
         {  
            c[i][j] += a[i][k] * b[k][j];
         }  
      }
   }
}

int main(void)
{
   clock_t s, e;

   init();
   s=clock();
   mult();
   e=clock();
   printf("        mult took %10d clocks\n", (int)(e-s));

   return 0;

}

	* confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
	Enable for znver4 and Core.

Comments

Richard Biener Dec. 12, 2023, 3:01 p.m. UTC | #1
On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hi,
> this patch disables use of FMA in matrix multiplication loop for generic (for
> x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
>
> For Intel this is neutral both on the matrix multiplication microbenchmark
> (attached) and spec2k17 where the difference was within noise for Core.
>
> On core the micro-benchmark runs as follows:
>
> With FMA:
>
>        578,500,241      cycles:u                         #    3.645 GHz                         ( +-  0.12% )
>        753,318,477      instructions:u                   #    1.30  insn per cycle              ( +-  0.00% )
>        125,417,701      branches:u                       #  790.227 M/sec                       ( +-  0.00% )
>           0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
>
>
> No FMA:
>
>        577,573,960      cycles:u                         #    3.514 GHz                         ( +-  0.15% )
>        878,318,479      instructions:u                   #    1.52  insn per cycle              ( +-  0.00% )
>        125,417,702      branches:u                       #  763.035 M/sec                       ( +-  0.00% )
>           0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
>
> So the cycle count is unchanged and discrete multiply+add takes same time as FMA.
>
> While on zen:
>
>
> With FMA:
>          484875179      cycles:u                         #    3.599 GHz                      ( +-  0.05% )  (82.11%)
>          752031517      instructions:u                   #    1.55  insn per cycle
>          125106525      branches:u                       #  928.712 M/sec                    ( +-  0.03% )  (85.09%)
>             128356      branch-misses:u                  #    0.10% of all branches          ( +-  0.06% )  (83.58%)
>
> No FMA:
>          375875209      cycles:u                         #    3.592 GHz                      ( +-  0.08% )  (80.74%)
>          875725341      instructions:u                   #    2.33  insn per cycle
>          124903825      branches:u                       #    1.194 G/sec                    ( +-  0.04% )  (84.59%)
>           0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
>
> The diffrerence is that Cores understand the fact that fmadd does not need
> all three parameters to start computation, while Zen cores doesn't.

This came up in a separate thread as well, but when doing reassoc of a
chain with
multiple dependent FMAs.

I can't understand how this uarch detail can affect performance when
as in the testcase
the longest input latency is on the multiplication from a memory load.
Do we actually
understand _why_ the FMAs are slower here?

Do we know that Cores can start the multiplication part when the add
operand isn't
ready yet?  I'm curious how you set up a micro benchmark to measure this.

There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per cycle.
So in theory we can at most do 2 FMA per cycle but with latency (FMA)
== 4 for Zen3/4
and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more
throughput when there are many FADD/FMUL ops to execute?  That works independent
on whether FMAs have a head-start on multiplication as you'd still be
bottle-necked
on the 2-wide issue for FMA?

On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a latency
of four.  So you should get worse results there (looking at the
numbers above you
do get worse results, slightly so), probably the higher number of uops is hidden
by the latency.

> Since this seems noticeable win on zen and not loss on Core it seems like good
> default for generic.
>
> I plan to commit the patch next week if there are no compplains.

complaint!

Richard.

> Honza
>
> #include <stdio.h>
> #include <time.h>
>
> #define SIZE 1000
>
> float a[SIZE][SIZE];
> float b[SIZE][SIZE];
> float c[SIZE][SIZE];
>
> void init(void)
> {
>    int i, j, k;
>    for(i=0; i<SIZE; ++i)
>    {
>       for(j=0; j<SIZE; ++j)
>       {
>          a[i][j] = (float)i + j;
>          b[i][j] = (float)i - j;
>          c[i][j] = 0.0f;
>       }
>    }
> }
>
> void mult(void)
> {
>    int i, j, k;
>
>    for(i=0; i<SIZE; ++i)
>    {
>       for(j=0; j<SIZE; ++j)
>       {
>          for(k=0; k<SIZE; ++k)
>          {
>             c[i][j] += a[i][k] * b[k][j];
>          }
>       }
>    }
> }
>
> int main(void)
> {
>    clock_t s, e;
>
>    init();
>    s=clock();
>    mult();
>    e=clock();
>    printf("        mult took %10d clocks\n", (int)(e-s));
>
>    return 0;
>
> }
>
>         * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
>         Enable for znver4 and Core.
>
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 43fa9e8fd6d..74b03cbcc60 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
>
>  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
>     smaller FMA chain.  */
> -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
> -          | m_YONGFENG)
> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> +          | m_YONGFENG | m_GENERIC)
>
>  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
>     smaller FMA chain.  */
> -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
> -         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> +         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
>
>  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
>     smaller FMA chain.  */
Jan Hubicka Dec. 12, 2023, 4:48 p.m. UTC | #2
> 
> This came up in a separate thread as well, but when doing reassoc of a
> chain with
> multiple dependent FMAs.
> 
> I can't understand how this uarch detail can affect performance when
> as in the testcase
> the longest input latency is on the multiplication from a memory load.
> Do we actually
> understand _why_ the FMAs are slower here?

This is my understanding:
The loop is well predictable and memory caluclations + loads can happen
in parallel.  So the main dependency chain is updating the accumulator
computing c[i][j].  FMADD is 4 cycles on Zen4, while ADD is 3.  So the
loop with FMADD can not run any faster than one iteration per 4 cycles,
while ADD can do one iteration per 3.  Which roughtly matches the
speedup we see 484875179*3/4=363656384 while measured speed is 375875209
cycles.  The benchmark is quite short and I run it 100 times in perf to
collect the data so the overhead is probably attributing to smaller then
expected difference.

> 
> Do we know that Cores can start the multiplication part when the add
> operand isn't
> ready yet?  I'm curious how you set up a micro benchmark to measure this.

Here is cycle counting benchmark:
#include <stdio.h>
int
main()
{ 
  float o=0;
  for (int i = 0; i < 1000000000; i++)
  {
#ifdef ACCUMULATE
    float p1 = o;
    float p2 = 0;
#else
    float p1 = 0;
    float p2 = o;
#endif
    float p3 = 0;
#ifdef FMA
    asm ("vfmadd231ss %2, %3, %0":"=x"(o):"0"(p1),"x"(p2),"x"(p3));
#else
    float t;
    asm ("mulss %2, %0":"=x"(t):"0"(p2),"x"(p3));
    asm ("addss %2, %0":"=x"(o):"0"(p1),"x"(t));
#endif
  }
  printf ("%f\n",o);
  return 0;
}

It performs FMAs in sequence all with zeros.  If you define ACCUMULATE
you get the pattern from matrix multiplication. On Zen I get:

jh@ryzen3:~> gcc -O3 -DFMA -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles:
     4,001,011,489      cycles:u                         #    4.837 GHz                         (83.32%)
jh@ryzen3:~> gcc -O3 -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles:
     3,000,335,064      cycles:u                         #    4.835 GHz                         (83.08%)

So 4 cycles for FMA loop and 3 cycles for separate mul and add.
Muls execute in parallel to adds in the second case.
If the dependence chain is done over multiplied paramter I get:

jh@ryzen3:~> gcc -O3 -DFMA l.c ; perf stat ./a.out 2>&1 | grep cycles:
     4,000,118,069      cycles:u                         #    4.836 GHz                         (83.32%)
jh@ryzen3:~> gcc -O3  l.c ; perf stat ./a.out 2>&1 | grep cycles:
     6,001,947,341      cycles:u                         #    4.838 GHz                         (83.32%)

FMA is the same (it is still one FMA instruction periteration) while
mul+add is 6 cycles since the dependency chain is longer.

Core gives me:

jh@aster:~> gcc -O3 l.c -DFMA -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u
     5,001,515,473      cycles:u                         #    3.796 GHz
jh@aster:~> gcc -O3 l.c  -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u
     4,000,977,739      cycles:u                         #    3.819 GHz
jh@aster:~> gcc -O3 l.c  -DFMA ; perf stat ./a.out 2>&1 | grep cycles:u
     5,350,523,047      cycles:u                         #    3.814 GHz
jh@aster:~> gcc -O3 l.c   ; perf stat ./a.out 2>&1 | grep cycles:u
    10,251,994,240      cycles:u                         #    3.852 GHz

So FMA seems 5 cycles if we accumulate and bit more (off noise) if we do
the long chain.  I think some cores have bigger difference between these
two numbers.
I am bit surprised of the last number of 10 cycles.  I would expect 8.

I changed the matrix multiplication benchmark to repeat multiplication
100 times

> 
> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per cycle.
> So in theory we can at most do 2 FMA per cycle but with latency (FMA)
> == 4 for Zen3/4
> and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more
> throughput when there are many FADD/FMUL ops to execute?  That works independent
> on whether FMAs have a head-start on multiplication as you'd still be
> bottle-necked
> on the 2-wide issue for FMA?

I am not sure I follow what you say here.  The knob only checks for
FMADDS used in accmulation type loop, so it is latency 4 and latency 3
per accumulation.  Indeed in ohter loops fmadd is win.
> 
> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a latency
> of four.  So you should get worse results there (looking at the
> numbers above you
> do get worse results, slightly so), probably the higher number of uops is hidden
> by the latency.
I think the slower non-FMA on Core was just a noise (it shows in overall
time but not in cycle counts).

I changed the benchmark to run the multiplication 100 times.
On Intel I get:

jh@aster:~/gcc/build/gcc> gcc matrix-nofma.s ; perf stat ./a.out
        mult took   15146405 clocks

 Performance counter stats for './a.out':

         15,149.62 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
               948      page-faults:u                    #   62.576 /sec                      
    55,803,919,561      cycles:u                         #    3.684 GHz                       
    87,615,590,411      instructions:u                   #    1.57  insn per cycle            
    12,512,896,307      branches:u                       #  825.955 M/sec                     
        12,605,403      branch-misses:u                  #    0.10% of all branches           

      15.150064855 seconds time elapsed

      15.146817000 seconds user
       0.003333000 seconds sys


jh@aster:~/gcc/build/gcc> gcc matrix-fma.s ; perf stat ./a.out
        mult took   15308879 clocks

 Performance counter stats for './a.out':

         15,312.27 msec task-clock:u                     #    1.000 CPUs utilized             
                 1      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
               948      page-faults:u                    #   61.911 /sec                      
    59,449,535,152      cycles:u                         #    3.882 GHz                       
    75,115,590,460      instructions:u                   #    1.26  insn per cycle            
    12,512,896,356      branches:u                       #  817.181 M/sec                     
        12,605,235      branch-misses:u                  #    0.10% of all branches           

      15.312776274 seconds time elapsed

      15.309462000 seconds user
       0.003333000 seconds sys

The difference seems close to noise.
If I am counting right, with 100*1000*1000*1000 multiplications
5*100*1000*1000*1000/8=62500000000 cycles overall.
Perhaps since the chain is independent for every 125 multilications it
runs a bit fater.

jh@alberti:~> gcc matrix-nofma.s ; perf stat ./a.out
        mult took   10046353 clocks

 Performance counter stats for './a.out':

          10051.47 msec task-clock:u                     #    0.999 CPUs utilized          
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
               940      page-faults:u                    #   93.519 /sec                   
       36983540385      cycles:u                         #    3.679 GHz                      (83.34%)
           3535506      stalled-cycles-frontend:u        #    0.01% frontend cycles idle     (83.33%)
          12252917      stalled-cycles-backend:u         #    0.03% backend cycles idle      (83.34%)
       87650235892      instructions:u                   #    2.37  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (83.34%)
       12504689935      branches:u                       #    1.244 G/sec                    (83.33%)
          12606975      branch-misses:u                  #    0.10% of all branches          (83.32%)

      10.059089949 seconds time elapsed

      10.048218000 seconds user
       0.003998000 seconds sys


jh@alberti:~> gcc matrix-fma.s ; perf stat ./a.out
        mult took   13147631 clocks

 Performance counter stats for './a.out':

          13152.81 msec task-clock:u                     #    0.999 CPUs utilized          
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
               940      page-faults:u                    #   71.468 /sec                   
       48394201333      cycles:u                         #    3.679 GHz                      (83.32%)
           4251637      stalled-cycles-frontend:u        #    0.01% frontend cycles idle     (83.32%)
          13664772      stalled-cycles-backend:u         #    0.03% backend cycles idle      (83.34%)
       75101376364      instructions:u                   #    1.55  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (83.35%)
       12510705466      branches:u                       #  951.182 M/sec                    (83.34%)
          12612898      branch-misses:u                  #    0.10% of all branches          (83.33%)

      13.162186067 seconds time elapsed

      13.153354000 seconds user
       0.000000000 seconds sys

So here I wuld expet 3*100*1000*1000*1000/8=37500000000 cycles for first
and 4*100*1000*1000*1000/8 = 50000000000 cycles for second.
So again small over-statement apparently due to parallelism between
vector multiplication, but overall it seems to match what I would expect
to see.

Honza

> 
> > Since this seems noticeable win on zen and not loss on Core it seems like good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
> 
> complaint!
> 
> Richard.
> 
> > Honza
> >
> > #include <stdio.h>
> > #include <time.h>
> >
> > #define SIZE 1000
> >
> > float a[SIZE][SIZE];
> > float b[SIZE][SIZE];
> > float c[SIZE][SIZE];
> >
> > void init(void)
> > {
> >    int i, j, k;
> >    for(i=0; i<SIZE; ++i)
> >    {
> >       for(j=0; j<SIZE; ++j)
> >       {
> >          a[i][j] = (float)i + j;
> >          b[i][j] = (float)i - j;
> >          c[i][j] = 0.0f;
> >       }
> >    }
> > }
> >
> > void mult(void)
> > {
> >    int i, j, k;
> >
> >    for(i=0; i<SIZE; ++i)
> >    {
> >       for(j=0; j<SIZE; ++j)
> >       {
> >          for(k=0; k<SIZE; ++k)
> >          {
> >             c[i][j] += a[i][k] * b[k][j];
> >          }
> >       }
> >    }
> > }
> >
> > int main(void)
> > {
> >    clock_t s, e;
> >
> >    init();
> >    s=clock();
> >    mult();
> >    e=clock();
> >    printf("        mult took %10d clocks\n", (int)(e-s));
> >
> >    return 0;
> >
> > }
> >
> >         * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
> >         Enable for znver4 and Core.
> >
> > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > index 43fa9e8fd6d..74b03cbcc60 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
> >
> >  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> >     smaller FMA chain.  */
> > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
> > -          | m_YONGFENG)
> > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > +          | m_YONGFENG | m_GENERIC)
> >
> >  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> >     smaller FMA chain.  */
> > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
> > -         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > +         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
> >
> >  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> >     smaller FMA chain.  */
Alexander Monakov Dec. 12, 2023, 5:08 p.m. UTC | #3
On Tue, 12 Dec 2023, Richard Biener wrote:

> On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> >
> > Hi,
> > this patch disables use of FMA in matrix multiplication loop for generic (for
> > x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
> >
> > For Intel this is neutral both on the matrix multiplication microbenchmark
> > (attached) and spec2k17 where the difference was within noise for Core.
> >
> > On core the micro-benchmark runs as follows:
> >
> > With FMA:
> >
> >        578,500,241      cycles:u                         #    3.645 GHz                         ( +-  0.12% )
> >        753,318,477      instructions:u                   #    1.30  insn per cycle              ( +-  0.00% )
> >        125,417,701      branches:u                       #  790.227 M/sec                       ( +-  0.00% )
> >           0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
> >
> >
> > No FMA:
> >
> >        577,573,960      cycles:u                         #    3.514 GHz                         ( +-  0.15% )
> >        878,318,479      instructions:u                   #    1.52  insn per cycle              ( +-  0.00% )
> >        125,417,702      branches:u                       #  763.035 M/sec                       ( +-  0.00% )
> >           0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
> >
> > So the cycle count is unchanged and discrete multiply+add takes same time as FMA.
> >
> > While on zen:
> >
> >
> > With FMA:
> >          484875179      cycles:u                         #    3.599 GHz                      ( +-  0.05% )  (82.11%)
> >          752031517      instructions:u                   #    1.55  insn per cycle
> >          125106525      branches:u                       #  928.712 M/sec                    ( +-  0.03% )  (85.09%)
> >             128356      branch-misses:u                  #    0.10% of all branches          ( +-  0.06% )  (83.58%)
> >
> > No FMA:
> >          375875209      cycles:u                         #    3.592 GHz                      ( +-  0.08% )  (80.74%)
> >          875725341      instructions:u                   #    2.33  insn per cycle
> >          124903825      branches:u                       #    1.194 G/sec                    ( +-  0.04% )  (84.59%)
> >           0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
> >
> > The diffrerence is that Cores understand the fact that fmadd does not need
> > all three parameters to start computation, while Zen cores doesn't.
> 
> This came up in a separate thread as well, but when doing reassoc of a
> chain with multiple dependent FMAs.

> I can't understand how this uarch detail can affect performance when as in
> the testcase the longest input latency is on the multiplication from a
> memory load.

The latency from the memory operand doesn't matter since it's not a part
of the critical path. The memory uop of the FMA starts executing as soon
as the address is ready.

> Do we actually understand _why_ the FMAs are slower here?

It's simple, on Zen4 FMA has latency 4 while add has latency 3, and you
clearly see it in the quoted numbers: zen-with-fma has slightly below 4
cycles per branch, zen-without-fma has exactly 3 cycles per branch.

Please refer to uops.info for latency data:
https://uops.info/html-instr/VMULPS_YMM_YMM_YMM.html
https://uops.info/html-instr/VFMADD231PS_YMM_YMM_YMM.html

> Do we know that Cores can start the multiplication part when the add
> operand isn't ready yet?  I'm curious how you set up a micro benchmark to
> measure this.

Unlike some of the Arm cores, none of x86 cores can consume the addend
of an FMA on a later cycle than the multiplicands, with Alder Lake-E
being the sole exception, apparently (see 6/10/10 latencies in the
aforementioned uops.info FMA page).

> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per
> cycle.  So in theory we can at most do 2 FMA per cycle but with latency
> (FMA) == 4 for Zen3/4 and latency (FADD/FMUL) == 3 we might be able to
> squeeze out a little bit more throughput when there are many FADD/FMUL ops
> to execute?  That works independent on whether FMAs have a head-start on
> multiplication as you'd still be bottle-necked on the 2-wide issue for
> FMA?

It doesn't matter here since all FMAs/FMULs are dependent on each other
so the processor can start a new FMA only each 4th (or 3rd cycle), except
when starting a new iteration of the outer loop.

> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a
> latency of four.  So you should get worse results there (looking at the
> numbers above you do get worse results, slightly so), probably the higher
> number of uops is hidden by the latency.

A simple solution would be to enable AVOID_FMA_CHAINS when FMA latency 
exceeds FMUL latency (all Zens and Broadwell).

> > Since this seems noticeable win on zen and not loss on Core it seems like good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
> 
> complaint!

Thanks for raising this, hopefully my explanation clears it up.

Alexander
Hongtao Liu Dec. 12, 2023, 11:56 p.m. UTC | #4
On Tue, Dec 12, 2023 at 10:38 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hi,
> this patch disables use of FMA in matrix multiplication loop for generic (for
> x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
>
> For Intel this is neutral both on the matrix multiplication microbenchmark
> (attached) and spec2k17 where the difference was within noise for Core.
>
> On core the micro-benchmark runs as follows:
>
> With FMA:
>
>        578,500,241      cycles:u                         #    3.645 GHz                         ( +-  0.12% )
>        753,318,477      instructions:u                   #    1.30  insn per cycle              ( +-  0.00% )
>        125,417,701      branches:u                       #  790.227 M/sec                       ( +-  0.00% )
>           0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
>
>
> No FMA:
>
>        577,573,960      cycles:u                         #    3.514 GHz                         ( +-  0.15% )
>        878,318,479      instructions:u                   #    1.52  insn per cycle              ( +-  0.00% )
>        125,417,702      branches:u                       #  763.035 M/sec                       ( +-  0.00% )
>           0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
>
> So the cycle count is unchanged and discrete multiply+add takes same time as FMA.
>
> While on zen:
>
>
> With FMA:
>          484875179      cycles:u                         #    3.599 GHz                      ( +-  0.05% )  (82.11%)
>          752031517      instructions:u                   #    1.55  insn per cycle
>          125106525      branches:u                       #  928.712 M/sec                    ( +-  0.03% )  (85.09%)
>             128356      branch-misses:u                  #    0.10% of all branches          ( +-  0.06% )  (83.58%)
>
> No FMA:
>          375875209      cycles:u                         #    3.592 GHz                      ( +-  0.08% )  (80.74%)
>          875725341      instructions:u                   #    2.33  insn per cycle
>          124903825      branches:u                       #    1.194 G/sec                    ( +-  0.04% )  (84.59%)
>           0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
>
> The diffrerence is that Cores understand the fact that fmadd does not need
> all three parameters to start computation, while Zen cores doesn't.
>
> Since this seems noticeable win on zen and not loss on Core it seems like good
> default for generic.
>
> I plan to commit the patch next week if there are no compplains.
The generic part LGTM.(It's exactly what we proposed in [1])

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html
>
> Honza
>
> #include <stdio.h>
> #include <time.h>
>
> #define SIZE 1000
>
> float a[SIZE][SIZE];
> float b[SIZE][SIZE];
> float c[SIZE][SIZE];
>
> void init(void)
> {
>    int i, j, k;
>    for(i=0; i<SIZE; ++i)
>    {
>       for(j=0; j<SIZE; ++j)
>       {
>          a[i][j] = (float)i + j;
>          b[i][j] = (float)i - j;
>          c[i][j] = 0.0f;
>       }
>    }
> }
>
> void mult(void)
> {
>    int i, j, k;
>
>    for(i=0; i<SIZE; ++i)
>    {
>       for(j=0; j<SIZE; ++j)
>       {
>          for(k=0; k<SIZE; ++k)
>          {
>             c[i][j] += a[i][k] * b[k][j];
>          }
>       }
>    }
> }
>
> int main(void)
> {
>    clock_t s, e;
>
>    init();
>    s=clock();
>    mult();
>    e=clock();
>    printf("        mult took %10d clocks\n", (int)(e-s));
>
>    return 0;
>
> }
>
>         * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
>         Enable for znver4 and Core.
>
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 43fa9e8fd6d..74b03cbcc60 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
>
>  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
>     smaller FMA chain.  */
> -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
> -          | m_YONGFENG)
> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> +          | m_YONGFENG | m_GENERIC)
>
>  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
>     smaller FMA chain.  */
> -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
> -         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> +         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
>
>  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
>     smaller FMA chain.  */
Jan Hubicka Dec. 13, 2023, 4:03 p.m. UTC | #5
> > The diffrerence is that Cores understand the fact that fmadd does not need
> > all three parameters to start computation, while Zen cores doesn't.
> >
> > Since this seems noticeable win on zen and not loss on Core it seems like good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
> The generic part LGTM.(It's exactly what we proposed in [1])
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html

Thanks.  I wonder if can think of other generic changes that would make
sense to do?
Concerning zen4 and FMA, it is not really win with AVX512 enabled
(which is what I was benchmarking for znver4 tuning), but indeed it is
win with AVX256 where the extra latency is not hidden by the parallelism
exposed by doing evertyhing twice.

I re-benmchmarked zen4 and it behaves similarly to zen3 with avx256, so
for x86-64-v3 this makes sense.

Honza
> >
> > Honza
> >
> > #include <stdio.h>
> > #include <time.h>
> >
> > #define SIZE 1000
> >
> > float a[SIZE][SIZE];
> > float b[SIZE][SIZE];
> > float c[SIZE][SIZE];
> >
> > void init(void)
> > {
> >    int i, j, k;
> >    for(i=0; i<SIZE; ++i)
> >    {
> >       for(j=0; j<SIZE; ++j)
> >       {
> >          a[i][j] = (float)i + j;
> >          b[i][j] = (float)i - j;
> >          c[i][j] = 0.0f;
> >       }
> >    }
> > }
> >
> > void mult(void)
> > {
> >    int i, j, k;
> >
> >    for(i=0; i<SIZE; ++i)
> >    {
> >       for(j=0; j<SIZE; ++j)
> >       {
> >          for(k=0; k<SIZE; ++k)
> >          {
> >             c[i][j] += a[i][k] * b[k][j];
> >          }
> >       }
> >    }
> > }
> >
> > int main(void)
> > {
> >    clock_t s, e;
> >
> >    init();
> >    s=clock();
> >    mult();
> >    e=clock();
> >    printf("        mult took %10d clocks\n", (int)(e-s));
> >
> >    return 0;
> >
> > }
> >
> >         * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
> >         Enable for znver4 and Core.
> >
> > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > index 43fa9e8fd6d..74b03cbcc60 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
> >
> >  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> >     smaller FMA chain.  */
> > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
> > -          | m_YONGFENG)
> > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > +          | m_YONGFENG | m_GENERIC)
> >
> >  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> >     smaller FMA chain.  */
> > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
> > -         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > +         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
> >
> >  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> >     smaller FMA chain.  */
> 
> 
> 
> -- 
> BR,
> Hongtao
Hongtao Liu Jan. 8, 2024, 3:16 a.m. UTC | #6
On Thu, Dec 14, 2023 at 12:03 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > > The diffrerence is that Cores understand the fact that fmadd does not need
> > > all three parameters to start computation, while Zen cores doesn't.
> > >
> > > Since this seems noticeable win on zen and not loss on Core it seems like good
> > > default for generic.
> > >
> > > I plan to commit the patch next week if there are no compplains.
> > The generic part LGTM.(It's exactly what we proposed in [1])
> >
> > [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html
>
> Thanks.  I wonder if can think of other generic changes that would make
> sense to do?
> Concerning zen4 and FMA, it is not really win with AVX512 enabled
> (which is what I was benchmarking for znver4 tuning), but indeed it is
> win with AVX256 where the extra latency is not hidden by the parallelism
> exposed by doing evertyhing twice.
>
> I re-benmchmarked zen4 and it behaves similarly to zen3 with avx256, so
> for x86-64-v3 this makes sense.
>
> Honza
> > >
> > > Honza
> > >
> > > #include <stdio.h>
> > > #include <time.h>
> > >
> > > #define SIZE 1000
> > >
> > > float a[SIZE][SIZE];
> > > float b[SIZE][SIZE];
> > > float c[SIZE][SIZE];
> > >
> > > void init(void)
> > > {
> > >    int i, j, k;
> > >    for(i=0; i<SIZE; ++i)
> > >    {
> > >       for(j=0; j<SIZE; ++j)
> > >       {
> > >          a[i][j] = (float)i + j;
> > >          b[i][j] = (float)i - j;
> > >          c[i][j] = 0.0f;
> > >       }
> > >    }
> > > }
> > >
> > > void mult(void)
> > > {
> > >    int i, j, k;
> > >
> > >    for(i=0; i<SIZE; ++i)
> > >    {
> > >       for(j=0; j<SIZE; ++j)
> > >       {
> > >          for(k=0; k<SIZE; ++k)
> > >          {
> > >             c[i][j] += a[i][k] * b[k][j];
> > >          }
> > >       }
> > >    }
> > > }
> > >
> > > int main(void)
> > > {
> > >    clock_t s, e;
> > >
> > >    init();
> > >    s=clock();
> > >    mult();
> > >    e=clock();
> > >    printf("        mult took %10d clocks\n", (int)(e-s));
> > >
> > >    return 0;
> > >
> > > }
> > >
> > >         * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
> > >         Enable for znver4 and Core.
> > >
> > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > > index 43fa9e8fd6d..74b03cbcc60 100644
> > > --- a/gcc/config/i386/x86-tune.def
> > > +++ b/gcc/config/i386/x86-tune.def
> > > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
> > >
> > >  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> > >     smaller FMA chain.  */
> > > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
> > > -          | m_YONGFENG)
> > > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > > +          | m_YONGFENG | m_GENERIC)
> > >
> > >  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> > >     smaller FMA chain.  */
> > > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
> > > -         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> > > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > > +         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
Can we backport the patch(at least the generic part) to
GCC11/GCC12/GCC13 release branch?
> > >
> > >  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> > >     smaller FMA chain.  */
> >
> >
> >
> > --
> > BR,
> > Hongtao
Jan Hubicka Jan. 17, 2024, 5:29 p.m. UTC | #7
> Can we backport the patch(at least the generic part) to
> GCC11/GCC12/GCC13 release branch?

Yes, the periodic testers has took the change and as far as I can tell,
there are no surprises.

Thanks,
Honza
> > > >
> > > >  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> > > >     smaller FMA chain.  */
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
> 
> 
> 
> -- 
> BR,
> Hongtao
diff mbox series

Patch

diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 43fa9e8fd6d..74b03cbcc60 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -515,13 +515,13 @@  DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
 
 /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
    smaller FMA chain.  */
-DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
-          | m_YONGFENG)
+DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
+          | m_YONGFENG | m_GENERIC)
 
 /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
    smaller FMA chain.  */
-DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
-	  | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
+DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
+	  | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
 
 /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
    smaller FMA chain.  */