diff mbox

[libfortran] Add AVX-specific matmul

Message ID 05fbb04a-f4c1-cb61-9baa-7a86ea673784@netcologne.de
State New
Headers show

Commit Message

Thomas Koenig Nov. 16, 2016, 9:30 p.m. UTC
Hello world,

the attached patch adds an AVX-specific version of the matmul
intrinsic to the Fortran library.  This works by using the target_clones
attribute.

For testing, I compiled this on powerpc64-unknown-linux-gnu,
without any ill effects.

Also, a resulting binary reached around 15 GFlops for larger matrices
on a 3.4 GHz i7-2600 CPU.  I am currently building/regtesting on
that machine. This can give another 40% speed increase  for large
matrices on AVX.

OK for trunk?

Regards

	Thomas

2016-11-16  Thomas Koenig  <tkoenig@gcc.gnu.org>

         PR fortran/78379
         * m4/matmul.m4:  For x86_64, make the work function for matmul
         static with target_clones for AVX and default, and create
         a wrapper function to call it.
         * generated/matmul_c10.c
         * generated/matmul_c16.c: Regenerated.
         * generated/matmul_c4.c: Regenerated.
         * generated/matmul_c8.c: Regenerated.
         * generated/matmul_i1.c: Regenerated.
         * generated/matmul_i16.c: Regenerated.
         * generated/matmul_i2.c: Regenerated.
         * generated/matmul_i4.c: Regenerated.
         * generated/matmul_i8.c: Regenerated.
         * generated/matmul_r10.c: Regenerated.
         * generated/matmul_r16.c: Regenerated.
         * generated/matmul_r4.c: Regenerated.
         * generated/matmul_r8.c: Regenerated.

Comments

Jakub Jelinek Nov. 16, 2016, 10:01 p.m. UTC | #1
On Wed, Nov 16, 2016 at 10:30:03PM +0100, Thomas Koenig wrote:
> the attached patch adds an AVX-specific version of the matmul
> intrinsic to the Fortran library.  This works by using the target_clones
> attribute.

Don't you need to test in configure if the assembler supports AVX?
Otherwise if somebody is bootstrapping gcc with older assembler, it will
just fail to bootstrap.
For matmul_i*, wouldn't it make more sense to use avx2 instead of avx,
or both avx and avx2 and maybe avx512f?

> 2016-11-16  Thomas Koenig  <tkoenig@gcc.gnu.org>
> 
>         PR fortran/78379
>         * m4/matmul.m4:  For x86_64, make the work function for matmul

Why the extra space before For?

>         static with target_clones for AVX and default, and create
>         a wrapper function to call it.
>         * generated/matmul_c10.c

Missing : Regenerated.

	Jakub
Thomas Koenig Nov. 16, 2016, 11:03 p.m. UTC | #2
Am 16.11.2016 um 23:01 schrieb Jakub Jelinek:
> On Wed, Nov 16, 2016 at 10:30:03PM +0100, Thomas Koenig wrote:
>> the attached patch adds an AVX-specific version of the matmul
>> intrinsic to the Fortran library.  This works by using the target_clones
>> attribute.
>
> Don't you need to test in configure if the assembler supports AVX?
> Otherwise if somebody is bootstrapping gcc with older assembler, it will
> just fail to bootstrap.

That's a good point.  The AVX instructions were added in binutils 2.19,
which was released in 2011. This could be put in the prerequisites.

What should the test do?  Fail with an error message "you need newer
binutils" or simply (and silently) not compile the AVX vesion?

> For matmul_i*, wouldn't it make more sense to use avx2 instead of avx,
> or both avx and avx2 and maybe avx512f?

I did a vdiff of the disassembled code generated or avx and avx2, and
(somewhat to my surprise) there was no difference.  Maybe, with more
unrolling, something more might have happened. I didn't check for
AVX512f, but I can do that.

>> 2016-11-16  Thomas Koenig  <tkoenig@gcc.gnu.org>
>>
>>         PR fortran/78379
>>         * m4/matmul.m4:  For x86_64, make the work function for matmul
>
> Why the extra space before For?

Will be removed.

>>         static with target_clones for AVX and default, and create
>>         a wrapper function to call it.
>>         * generated/matmul_c10.c
>
> Missing : Regenerated.

Will be added.

Regards

	Thomas
Jerry DeLisle Nov. 16, 2016, 11:06 p.m. UTC | #3
On 11/16/2016 01:30 PM, Thomas Koenig wrote:
> Hello world,
>
> the attached patch adds an AVX-specific version of the matmul
> intrinsic to the Fortran library.  This works by using the target_clones
> attribute.
>
> For testing, I compiled this on powerpc64-unknown-linux-gnu,
> without any ill effects.
>
> Also, a resulting binary reached around 15 GFlops for larger matrices
> on a 3.4 GHz i7-2600 CPU.  I am currently building/regtesting on
> that machine. This can give another 40% speed increase  for large
> matrices on AVX.
>
> OK for trunk?
>

Did you intend to name it avx_matmul and not aux_matmul?

Are the compiler flags for avx handled automatically by the gcc attributes so no 
need to endit the Makefile.am?

Fix the first and if yes to the second question, OK

Jerry
Jakub Jelinek Nov. 16, 2016, 11:20 p.m. UTC | #4
On Thu, Nov 17, 2016 at 12:03:18AM +0100, Thomas Koenig wrote:
> >Don't you need to test in configure if the assembler supports AVX?
> >Otherwise if somebody is bootstrapping gcc with older assembler, it will
> >just fail to bootstrap.
> 
> That's a good point.  The AVX instructions were added in binutils 2.19,
> which was released in 2011. This could be put in the prerequisites.
> 
> What should the test do?  Fail with an error message "you need newer
> binutils" or simply (and silently) not compile the AVX vesion?

From what I understood, you want those functions just to be implementation
details, not exported from libgfortran.so*.  Thus the test would do
something similar to what gcc/testsuite/lib/target-supports.exp (check_effective_target_avx)
does, but of course in autoconf way, not in tcl.
Also, from what I see, target_clones just use IFUNCs, so you probably also
need some configure test whether ifuncs are supported (the
gcc.target/i386/mvc* tests use dg-require-ifunc, so you'd need something
similar again in configure.  But if so, then I have no idea why you use
a wrapper around the function, instead of using it on the exported APIs.

> >For matmul_i*, wouldn't it make more sense to use avx2 instead of avx,
> >or both avx and avx2 and maybe avx512f?
> 
> I did a vdiff of the disassembled code generated or avx and avx2, and
> (somewhat to my surprise) there was no difference.  Maybe, with more
> unrolling, something more might have happened. I didn't check for
> AVX512f, but I can do that.

For the float/double code it wouldn't surprise me (assuming you don't need
gather insns and similar stuff).  But for integers generally most of the
avx instructions can only handle 128-bit vectors, while avx2 has 256-bit
ones.

	Jakub
Thomas Koenig Nov. 17, 2016, 7:41 a.m. UTC | #5
Am 17.11.2016 um 00:20 schrieb Jakub Jelinek:
> On Thu, Nov 17, 2016 at 12:03:18AM +0100, Thomas Koenig wrote:
>>> Don't you need to test in configure if the assembler supports AVX?
>>> Otherwise if somebody is bootstrapping gcc with older assembler, it will
>>> just fail to bootstrap.
>>
>> That's a good point.  The AVX instructions were added in binutils 2.19,
>> which was released in 2011. This could be put in the prerequisites.
>>
>> What should the test do?  Fail with an error message "you need newer
>> binutils" or simply (and silently) not compile the AVX vesion?
>
>>From what I understood, you want those functions just to be implementation
> details, not exported from libgfortran.so*.  Thus the test would do
> something similar to what gcc/testsuite/lib/target-supports.exp (check_effective_target_avx)
> does, but of course in autoconf way, not in tcl.

OK, that looks straightworward enough. I'll give it a shot.

> Also, from what I see, target_clones just use IFUNCs, so you probably also
> need some configure test whether ifuncs are supported (the
> gcc.target/i386/mvc* tests use dg-require-ifunc, so you'd need something
> similar again in configure.  But if so, then I have no idea why you use
> a wrapper around the function, instead of using it on the exported APIs.

As you wrote above, I wanted this as an implementation detail. I also
wanted the ability to be able to add new instruction sets without
breaking the ABI.

Because the caller generates the ifunc, using a wrapper function seemed
like the best way to do it.  The overhead is neglible (the function
is one simple jump), especially considering that we only call the
library function for larger matrices.

>>> For matmul_i*, wouldn't it make more sense to use avx2 instead of avx,
>>> or both avx and avx2 and maybe avx512f?
>>
>> I did a vdiff of the disassembled code generated or avx and avx2, and
>> (somewhat to my surprise) there was no difference.  Maybe, with more
>> unrolling, something more might have happened. I didn't check for
>> AVX512f, but I can do that.
>
> For the float/double code it wouldn't surprise me (assuming you don't need
> gather insns and similar stuff).  But for integers generally most of the
> avx instructions can only handle 128-bit vectors, while avx2 has 256-bit
> ones,

You're right - integer multiplication looks different.

Nobody I know cares about integer matrix multiplication
speed, whereas real has gotten a _lot_ of attention over
the decades.  So, putting in AVX will make the code run
faster on more machines, while putting in AVX2 will
(IMHO) bloat the library for no good reason.  However,
I am willing to stand corrected on this. Putting in AVX512f
makes sense.

I have also been trying to get target_clones to work on POWER
to get Altivec instructions, but to no avail. I also cannot
find any examples in the testsuite.

Since a lot of supercomputers use POWER nodes, that might also
be attractive.

Regards

	Thomas
Janne Blomqvist Nov. 17, 2016, 7:57 a.m. UTC | #6
On Thu, Nov 17, 2016 at 9:41 AM, Thomas Koenig <tkoenig@netcologne.de> wrote:
> Am 17.11.2016 um 00:20 schrieb Jakub Jelinek:
>>
>> On Thu, Nov 17, 2016 at 12:03:18AM +0100, Thomas Koenig wrote:
>>>>
>>>> Don't you need to test in configure if the assembler supports AVX?
>>>> Otherwise if somebody is bootstrapping gcc with older assembler, it will
>>>> just fail to bootstrap.
>>>
>>>
>>> That's a good point.  The AVX instructions were added in binutils 2.19,
>>> which was released in 2011. This could be put in the prerequisites.
>>>
>>> What should the test do?  Fail with an error message "you need newer
>>> binutils" or simply (and silently) not compile the AVX vesion?
>>
>>
>>> From what I understood, you want those functions just to be
>>> implementation
>>
>> details, not exported from libgfortran.so*.  Thus the test would do
>> something similar to what gcc/testsuite/lib/target-supports.exp
>> (check_effective_target_avx)
>> does, but of course in autoconf way, not in tcl.
>
>
> OK, that looks straightworward enough. I'll give it a shot.
>
>> Also, from what I see, target_clones just use IFUNCs, so you probably also
>> need some configure test whether ifuncs are supported (the
>> gcc.target/i386/mvc* tests use dg-require-ifunc, so you'd need something
>> similar again in configure.  But if so, then I have no idea why you use
>> a wrapper around the function, instead of using it on the exported APIs.
>
>
> As you wrote above, I wanted this as an implementation detail. I also
> wanted the ability to be able to add new instruction sets without
> breaking the ABI.
>
> Because the caller generates the ifunc, using a wrapper function seemed
> like the best way to do it.  The overhead is neglible (the function
> is one simple jump), especially considering that we only call the
> library function for larger matrices.
>
>>>> For matmul_i*, wouldn't it make more sense to use avx2 instead of avx,
>>>> or both avx and avx2 and maybe avx512f?
>>>
>>>
>>> I did a vdiff of the disassembled code generated or avx and avx2, and
>>> (somewhat to my surprise) there was no difference.  Maybe, with more
>>> unrolling, something more might have happened. I didn't check for
>>> AVX512f, but I can do that.
>>
>>
>> For the float/double code it wouldn't surprise me (assuming you don't need
>> gather insns and similar stuff).  But for integers generally most of the
>> avx instructions can only handle 128-bit vectors, while avx2 has 256-bit
>> ones,
>
>
> You're right - integer multiplication looks different.
>
> Nobody I know cares about integer matrix multiplication
> speed, whereas real has gotten a _lot_ of attention over
> the decades.  So, putting in AVX will make the code run
> faster on more machines, while putting in AVX2 will
> (IMHO) bloat the library for no good reason.  However,
> I am willing to stand corrected on this. Putting in AVX512f
> makes sense.
>
> I have also been trying to get target_clones to work on POWER
> to get Altivec instructions, but to no avail. I also cannot
> find any examples in the testsuite.
>
> Since a lot of supercomputers use POWER nodes, that might also
> be attractive.
>
> Regards
>
>         Thomas

Hi,

In order to reduce bloat, might it make sense to make the core blocked
gemm algorithm that Jerry committed a few days ago into a separate
static function, and then only do the target_clone stuff for that one?
The rest of the matmul function deals with all kinds of stuff like
setup, handling non-stride-1 cases, calling the external gemm function
for -fexternal-blas etc., none of which vectorizes anyway so
generating different versions of this code using different vector
instructions looks like a waste?

In that case I guess one could add the avx2 variant as well on the odd
chance that somebody for some reason cares about integer matmul.
Jakub Jelinek Nov. 17, 2016, 4:22 p.m. UTC | #7
On Thu, Nov 17, 2016 at 08:41:48AM +0100, Thomas Koenig wrote:
> Am 17.11.2016 um 00:20 schrieb Jakub Jelinek:
> >On Thu, Nov 17, 2016 at 12:03:18AM +0100, Thomas Koenig wrote:
> >>>Don't you need to test in configure if the assembler supports AVX?
> >>>Otherwise if somebody is bootstrapping gcc with older assembler, it will
> >>>just fail to bootstrap.
> >>
> >>That's a good point.  The AVX instructions were added in binutils 2.19,
> >>which was released in 2011. This could be put in the prerequisites.
> >>
> >>What should the test do?  Fail with an error message "you need newer
> >>binutils" or simply (and silently) not compile the AVX vesion?
> >
> >>From what I understood, you want those functions just to be implementation
> >details, not exported from libgfortran.so*.  Thus the test would do
> >something similar to what gcc/testsuite/lib/target-supports.exp (check_effective_target_avx)
> >does, but of course in autoconf way, not in tcl.
> 
> OK, that looks straightworward enough. I'll give it a shot.
> 
> >Also, from what I see, target_clones just use IFUNCs, so you probably also
> >need some configure test whether ifuncs are supported (the
> >gcc.target/i386/mvc* tests use dg-require-ifunc, so you'd need something
> >similar again in configure.  But if so, then I have no idea why you use
> >a wrapper around the function, instead of using it on the exported APIs.
> 
> As you wrote above, I wanted this as an implementation detail. I also
> wanted the ability to be able to add new instruction sets without
> breaking the ABI.

But even exported IFUNC is an implementation detail.  For other
libraries/binaries IFUNC symbol is like any other symbol, they will have
SHN_UNDEF symbol pointing to that, and it matters only for the dynamic
linker during relocation processing.  Whether some function is IFUNC or not
is not an ABI change, you can change at any time a normal function into
IFUNC or vice versa, without breaking ABI.

> You're right - integer multiplication looks different.
> 
> Nobody I know cares about integer matrix multiplication
> speed, whereas real has gotten a _lot_ of attention over
> the decades.  So, putting in AVX will make the code run
> faster on more machines, while putting in AVX2 will
> (IMHO) bloat the library for no good reason.  However,
> I am willing to stand corrected on this. Putting in AVX512f
> makes sense.

Which is why I've been proposing to use avx2,default for the
matmul_i* files and avx,default for the others.
avx will not buy much for matmul_i*, while avx2 will.

> I have also been trying to get target_clones to work on POWER
> to get Altivec instructions, but to no avail. I also cannot
> find any examples in the testsuite.

Haven't checked, but maybe the target_clones attribute has been only
implemented for x86_64/i686 and not for other targets.
But power supports target attribute, so you e.g. have the option of
#including the routine multiple times in one TU, each time with different
name and target attribute, and then write the IFUNC routine for it by hand.
Or attempt to support target_clones on power, or ask power maintainers
to do that.

	Jakub
Thomas Koenig Nov. 27, 2016, 5:07 p.m. UTC | #8
I wrote:

> As an added bonus, I added some m4 hacks to disable both
> AVX and AVX2 code generation for REAL.

This should have read "I hadded some m4 hacks to disable
the AVX2 code generation for REAL."

Regards

	Thomas
Jerry DeLisle Nov. 27, 2016, 8:53 p.m. UTC | #9
On 11/27/2016 08:50 AM, Thomas Koenig wrote:
> Hello world,
>
> here is another, much revised, update of the AVX-specific matmul patch.
>
> The processor-specific switching is now done directly, using the

--- snip ---

This comment not right:

+/* Put exhaustive list of possible architectures here here, ORed together.  */

Performs as expected on my AMD machines. We can still improve peak performance 
on these by about 7%. To clarify, these chips require -mavx -mprefer-avx128. So 
what we need to do is sort out which AMD CPUs need this adjustment with AVX 
registers. (A later patch)

I would like to suggest that rather than matmul_internal.m4 to maybe name this 
file matmul_base.m4, but not critical.

Need a libgcc person for the changes to the cpuinfo items.

The libgfortran portions look OK.

Jerry
diff mbox

Patch

Index: generated/matmul_c10.c
===================================================================
--- generated/matmul_c10.c	(Revision 242477)
+++ generated/matmul_c10.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_c10 (gfc_array_c10 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c10);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_c10 (gfc_array_c10 * const restrict retarray, 
+	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_c10 (gfc_array_c10 * const restrict retarray, 
 	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_c10 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_c10 (gfc_array_c10 * const restrict retarray, 
+	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_c10 (gfc_array_c10 * const restrict retarray, 
+	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_COMPLEX_10 * restrict abase;
   const GFC_COMPLEX_10 * restrict bbase;
   GFC_COMPLEX_10 * restrict dest;
Index: generated/matmul_c16.c
===================================================================
--- generated/matmul_c16.c	(Revision 242477)
+++ generated/matmul_c16.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_c16 (gfc_array_c16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c16);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_c16 (gfc_array_c16 * const restrict retarray, 
+	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_c16 (gfc_array_c16 * const restrict retarray, 
 	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_c16 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_c16 (gfc_array_c16 * const restrict retarray, 
+	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_c16 (gfc_array_c16 * const restrict retarray, 
+	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_COMPLEX_16 * restrict abase;
   const GFC_COMPLEX_16 * restrict bbase;
   GFC_COMPLEX_16 * restrict dest;
Index: generated/matmul_c4.c
===================================================================
--- generated/matmul_c4.c	(Revision 242477)
+++ generated/matmul_c4.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_c4 (gfc_array_c4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c4);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_c4 (gfc_array_c4 * const restrict retarray, 
+	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_c4 (gfc_array_c4 * const restrict retarray, 
 	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_c4 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_c4 (gfc_array_c4 * const restrict retarray, 
+	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_c4 (gfc_array_c4 * const restrict retarray, 
+	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_COMPLEX_4 * restrict abase;
   const GFC_COMPLEX_4 * restrict bbase;
   GFC_COMPLEX_4 * restrict dest;
Index: generated/matmul_c8.c
===================================================================
--- generated/matmul_c8.c	(Revision 242477)
+++ generated/matmul_c8.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_c8 (gfc_array_c8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c8);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_c8 (gfc_array_c8 * const restrict retarray, 
+	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_c8 (gfc_array_c8 * const restrict retarray, 
 	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_c8 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_c8 (gfc_array_c8 * const restrict retarray, 
+	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_c8 (gfc_array_c8 * const restrict retarray, 
+	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_COMPLEX_8 * restrict abase;
   const GFC_COMPLEX_8 * restrict bbase;
   GFC_COMPLEX_8 * restrict dest;
Index: generated/matmul_i1.c
===================================================================
--- generated/matmul_i1.c	(Revision 242477)
+++ generated/matmul_i1.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_i1 (gfc_array_i1 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i1);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_i1 (gfc_array_i1 * const restrict retarray, 
+	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_i1 (gfc_array_i1 * const restrict retarray, 
 	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_i1 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_i1 (gfc_array_i1 * const restrict retarray, 
+	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_i1 (gfc_array_i1 * const restrict retarray, 
+	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_INTEGER_1 * restrict abase;
   const GFC_INTEGER_1 * restrict bbase;
   GFC_INTEGER_1 * restrict dest;
Index: generated/matmul_i16.c
===================================================================
--- generated/matmul_i16.c	(Revision 242477)
+++ generated/matmul_i16.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_i16 (gfc_array_i16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i16);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_i16 (gfc_array_i16 * const restrict retarray, 
+	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_i16 (gfc_array_i16 * const restrict retarray, 
 	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_i16 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_i16 (gfc_array_i16 * const restrict retarray, 
+	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_i16 (gfc_array_i16 * const restrict retarray, 
+	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_INTEGER_16 * restrict abase;
   const GFC_INTEGER_16 * restrict bbase;
   GFC_INTEGER_16 * restrict dest;
Index: generated/matmul_i2.c
===================================================================
--- generated/matmul_i2.c	(Revision 242477)
+++ generated/matmul_i2.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_i2 (gfc_array_i2 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i2);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_i2 (gfc_array_i2 * const restrict retarray, 
+	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_i2 (gfc_array_i2 * const restrict retarray, 
 	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_i2 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_i2 (gfc_array_i2 * const restrict retarray, 
+	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_i2 (gfc_array_i2 * const restrict retarray, 
+	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_INTEGER_2 * restrict abase;
   const GFC_INTEGER_2 * restrict bbase;
   GFC_INTEGER_2 * restrict dest;
Index: generated/matmul_i4.c
===================================================================
--- generated/matmul_i4.c	(Revision 242477)
+++ generated/matmul_i4.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_i4 (gfc_array_i4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i4);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_i4 (gfc_array_i4 * const restrict retarray, 
+	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_i4 (gfc_array_i4 * const restrict retarray, 
 	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_i4 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_i4 (gfc_array_i4 * const restrict retarray, 
+	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_i4 (gfc_array_i4 * const restrict retarray, 
+	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_INTEGER_4 * restrict abase;
   const GFC_INTEGER_4 * restrict bbase;
   GFC_INTEGER_4 * restrict dest;
Index: generated/matmul_i8.c
===================================================================
--- generated/matmul_i8.c	(Revision 242477)
+++ generated/matmul_i8.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_i8 (gfc_array_i8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i8);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_i8 (gfc_array_i8 * const restrict retarray, 
+	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_i8 (gfc_array_i8 * const restrict retarray, 
 	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_i8 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_i8 (gfc_array_i8 * const restrict retarray, 
+	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_i8 (gfc_array_i8 * const restrict retarray, 
+	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_INTEGER_8 * restrict abase;
   const GFC_INTEGER_8 * restrict bbase;
   GFC_INTEGER_8 * restrict dest;
Index: generated/matmul_r10.c
===================================================================
--- generated/matmul_r10.c	(Revision 242477)
+++ generated/matmul_r10.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_r10 (gfc_array_r10 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r10);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_r10 (gfc_array_r10 * const restrict retarray, 
+	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_r10 (gfc_array_r10 * const restrict retarray, 
 	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_r10 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_r10 (gfc_array_r10 * const restrict retarray, 
+	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_r10 (gfc_array_r10 * const restrict retarray, 
+	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_REAL_10 * restrict abase;
   const GFC_REAL_10 * restrict bbase;
   GFC_REAL_10 * restrict dest;
Index: generated/matmul_r16.c
===================================================================
--- generated/matmul_r16.c	(Revision 242477)
+++ generated/matmul_r16.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_r16 (gfc_array_r16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r16);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_r16 (gfc_array_r16 * const restrict retarray, 
+	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_r16 (gfc_array_r16 * const restrict retarray, 
 	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_r16 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_r16 (gfc_array_r16 * const restrict retarray, 
+	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_r16 (gfc_array_r16 * const restrict retarray, 
+	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_REAL_16 * restrict abase;
   const GFC_REAL_16 * restrict bbase;
   GFC_REAL_16 * restrict dest;
Index: generated/matmul_r4.c
===================================================================
--- generated/matmul_r4.c	(Revision 242477)
+++ generated/matmul_r4.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_r4 (gfc_array_r4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r4);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_r4 (gfc_array_r4 * const restrict retarray, 
+	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_r4 (gfc_array_r4 * const restrict retarray, 
 	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_r4 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_r4 (gfc_array_r4 * const restrict retarray, 
+	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_r4 (gfc_array_r4 * const restrict retarray, 
+	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_REAL_4 * restrict abase;
   const GFC_REAL_4 * restrict bbase;
   GFC_REAL_4 * restrict dest;
Index: generated/matmul_r8.c
===================================================================
--- generated/matmul_r8.c	(Revision 242477)
+++ generated/matmul_r8.c	(Arbeitskopie)
@@ -75,11 +75,37 @@  extern void matmul_r8 (gfc_array_r8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r8);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_r8 (gfc_array_r8 * const restrict retarray, 
+	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_r8 (gfc_array_r8 * const restrict retarray, 
 	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_r8 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_r8 (gfc_array_r8 * const restrict retarray, 
+	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_r8 (gfc_array_r8 * const restrict retarray, 
+	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_REAL_8 * restrict abase;
   const GFC_REAL_8 * restrict bbase;
   GFC_REAL_8 * restrict dest;
Index: m4/matmul.m4
===================================================================
--- m4/matmul.m4	(Revision 242477)
+++ m4/matmul.m4	(Arbeitskopie)
@@ -76,11 +76,37 @@  extern void matmul_'rtype_code` ('rtype` * const r
 	int blas_limit, blas_call gemm);
 export_proto(matmul_'rtype_code`);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_'rtype_code` ('rtype` * const restrict retarray, 
+	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_'rtype_code` ('rtype` * const restrict retarray, 
 	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_'rtype_code` (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_'rtype_code` ('rtype` * const restrict retarray, 
+	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_'rtype_code` ('rtype` * const restrict retarray, 
+	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const 'rtype_name` * restrict abase;
   const 'rtype_name` * restrict bbase;
   'rtype_name` * restrict dest;