diff mbox

[libgfortran] PR51119 - MATMUL slow for large matrices

Message ID 63f1ffa4-02b5-11c9-b64d-86336733d4b5@charter.net
State New
Headers show

Commit Message

Jerry DeLisle Nov. 14, 2016, 10:13 p.m. UTC
On 11/13/2016 11:03 PM, Thomas Koenig wrote:
> Hi Jerry,
>
> I think this
>
> +      /* Parameter adjustments */
> +      c_dim1 = m;
> +      c_offset = 1 + c_dim1;
>
> should be
>
> +      /* Parameter adjustments */
> +      c_dim1 = rystride;
> +      c_offset = 1 + c_dim1;
>
> Regarding options for matmul:  It is possible to add the
> options to the lines in Makefile.in
>
> # Turn on vectorization and loop unrolling for matmul.
> $(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize
> -funroll-loops
>
> This is a great step forward.  I think we can close most matmul-related
> PRs once this patch has been applied.
>
> Regards
>
>     Thomas
>

With Thomas suggestion, I can remove the #pragma optimize from the source code. 
Doing this: (long lines wrapped as shown)



Comparing gfortran 6 vs 7: (test program posted in PR51119)

$ gfc6 -static -Ofast -finline-matmul-limit=32 -funroll-loops --param 
max-unroll-times=4 compare.f90
$ ./a.out
  =========================================================
  ================            MEASURED GIGAFLOPS          =
  =========================================================
                  Matmul                           Matmul
                  fixed                 Matmul     variable
  Size  Loops     explicit   refMatmul  assumed    explicit
  =========================================================
     2  2000     11.928      0.047      0.082      0.138
     4  2000      1.455      0.220      0.371      0.316
     8  2000      1.476      0.737      0.704      1.574
    16  2000      4.536      3.755      2.825      3.820
    32  2000      6.070      5.443      3.124      5.158
    64  2000      5.423      5.355      5.405      5.413
   128  2000      5.913      5.841      5.917      5.917
   256   477      5.865      5.252      5.863      5.862
   512    59      2.794      2.841      2.794      2.791
  1024     7      1.662      1.356      1.662      1.661
  2048     1      1.753      1.724      1.753      1.754

$ gfc -static -Ofast -finline-matmul-limit=32 -funroll-loops --param 
max-unroll-times=4 compare.f90
$ ./a.out
  =========================================================
  ================            MEASURED GIGAFLOPS          =
  =========================================================
                  Matmul                           Matmul
                  fixed                 Matmul     variable
  Size  Loops     explicit   refMatmul  assumed    explicit
  =========================================================
     2  2000     12.146      0.042      0.090      0.146
     4  2000      1.496      0.232      0.384      0.325
     8  2000      2.330      0.765      0.763      0.965
    16  2000      4.611      4.120      2.792      3.830
    32  2000      6.068      5.265      3.102      4.859
    64  2000      6.527      5.329      6.425      6.495
   128  2000      8.207      5.643      8.336      8.441
   256   477      9.210      4.967      9.367      9.299
   512    59      8.330      2.772      8.422      8.342
  1024     7      8.430      1.378      8.511      8.424
  2048     1      8.339      1.718      8.425      8.322

I do think we need to adjust the default inline limit and should do this 
separately from this patch.

With these changes, OK for trunk?

Regards,

Jerry

Comments

Thomas Koenig Nov. 15, 2016, 7:22 a.m. UTC | #1
Hi Jerry,

> With these changes, OK for trunk?

Just going over this with a fine comb...

One thing just struck me:   The loop variables should be index_type, so

       const index_type m = xcount, n = ycount, k = count;

[...]

    index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
	      i3, i4, i5, i6;

       /* Local variables */
       GFC_REAL_4 t1[65536], /* was [256][256] */
		 f11, f12, f21, f22, f31, f32, f41, f42,
		 f13, f14, f23, f24, f33, f34, f43, f44;
       index_type i, j, l, ii, jj, ll;
       index_type isec, jsec, lsec, uisec, ujsec, ulsec;

I agree that we should do the tuning of the inline limit
separately.

When we do that, we should think about -Os.  With the buffering, we have
much more memory usage in the library function.  If -Os is in force,
we should also consider raising the limit for inlining.

Since I was involved in the development, I would like to give others a
few days to raise more comments.  If there are none, OK to commit with
the above change within a few days. Of course, somebody else might also
OK this patch :-)

Regards

	Thomas
Richard Biener Nov. 15, 2016, 8:21 a.m. UTC | #2
On Mon, Nov 14, 2016 at 11:13 PM, Jerry DeLisle <jvdelisle@charter.net> wrote:
> On 11/13/2016 11:03 PM, Thomas Koenig wrote:
>>
>> Hi Jerry,
>>
>> I think this
>>
>> +      /* Parameter adjustments */
>> +      c_dim1 = m;
>> +      c_offset = 1 + c_dim1;
>>
>> should be
>>
>> +      /* Parameter adjustments */
>> +      c_dim1 = rystride;
>> +      c_offset = 1 + c_dim1;
>>
>> Regarding options for matmul:  It is possible to add the
>> options to the lines in Makefile.in
>>
>> # Turn on vectorization and loop unrolling for matmul.
>> $(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS +=
>> -ftree-vectorize
>> -funroll-loops
>>
>> This is a great step forward.  I think we can close most matmul-related
>> PRs once this patch has been applied.
>>
>> Regards
>>
>>     Thomas
>>
>
> With Thomas suggestion, I can remove the #pragma optimize from the source
> code. Doing this: (long lines wrapped as shown)
>
> diff --git a/libgfortran/Makefile.am b/libgfortran/Makefile.am
> index 39d3e11..9ee17f9 100644
> --- a/libgfortran/Makefile.am
> +++ b/libgfortran/Makefile.am
> @@ -850,7 +850,7 @@ intrinsics/dprod_r8.f90 \
>  intrinsics/f2c_specifics.F90
>
>  # Turn on vectorization and loop unrolling for matmul.
> -$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize
> -funroll-loops
> +$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math
> -fno-protect-parens -fstack-arrays -ftree-vectorize -funroll-loops --param
> max-unroll-times=4 -ftree-loop-vectorize

-ftree-vectorize turns on -ftree-loop-vectorize and
-ftree-slp-vectorize already.

>  # Logical matmul doesn't vectorize.
>  $(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops
>
>
> Comparing gfortran 6 vs 7: (test program posted in PR51119)
>
> $ gfc6 -static -Ofast -finline-matmul-limit=32 -funroll-loops --param
> max-unroll-times=4 compare.f90
> $ ./a.out
>  =========================================================
>  ================            MEASURED GIGAFLOPS          =
>  =========================================================
>                  Matmul                           Matmul
>                  fixed                 Matmul     variable
>  Size  Loops     explicit   refMatmul  assumed    explicit
>  =========================================================
>     2  2000     11.928      0.047      0.082      0.138
>     4  2000      1.455      0.220      0.371      0.316
>     8  2000      1.476      0.737      0.704      1.574
>    16  2000      4.536      3.755      2.825      3.820
>    32  2000      6.070      5.443      3.124      5.158
>    64  2000      5.423      5.355      5.405      5.413
>   128  2000      5.913      5.841      5.917      5.917
>   256   477      5.865      5.252      5.863      5.862
>   512    59      2.794      2.841      2.794      2.791
>  1024     7      1.662      1.356      1.662      1.661
>  2048     1      1.753      1.724      1.753      1.754
>
> $ gfc -static -Ofast -finline-matmul-limit=32 -funroll-loops --param
> max-unroll-times=4 compare.f90
> $ ./a.out
>  =========================================================
>  ================            MEASURED GIGAFLOPS          =
>  =========================================================
>                  Matmul                           Matmul
>                  fixed                 Matmul     variable
>  Size  Loops     explicit   refMatmul  assumed    explicit
>  =========================================================
>     2  2000     12.146      0.042      0.090      0.146
>     4  2000      1.496      0.232      0.384      0.325
>     8  2000      2.330      0.765      0.763      0.965
>    16  2000      4.611      4.120      2.792      3.830
>    32  2000      6.068      5.265      3.102      4.859
>    64  2000      6.527      5.329      6.425      6.495
>   128  2000      8.207      5.643      8.336      8.441
>   256   477      9.210      4.967      9.367      9.299
>   512    59      8.330      2.772      8.422      8.342
>  1024     7      8.430      1.378      8.511      8.424
>  2048     1      8.339      1.718      8.425      8.322
>
> I do think we need to adjust the default inline limit and should do this
> separately from this patch.
>
> With these changes, OK for trunk?
>
> Regards,
>
> Jerry
>
Jerry DeLisle Nov. 15, 2016, 3:59 p.m. UTC | #3
On 11/14/2016 11:22 PM, Thomas Koenig wrote:
> Hi Jerry,
>
>> With these changes, OK for trunk?
>
> Just going over this with a fine comb...
>
> One thing just struck me:   The loop variables should be index_type, so
>
>       const index_type m = xcount, n = ycount, k = count;
>
> [...]
>
>    index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
>           i3, i4, i5, i6;
>
>       /* Local variables */
>       GFC_REAL_4 t1[65536], /* was [256][256] */
>          f11, f12, f21, f22, f31, f32, f41, f42,
>          f13, f14, f23, f24, f33, f34, f43, f44;
>       index_type i, j, l, ii, jj, ll;
>       index_type isec, jsec, lsec, uisec, ujsec, ulsec;
>
> I agree that we should do the tuning of the inline limit
> separately.
>

Several of my iterations used index_type. I found using integer gives better 
performance. The reason is that they are of type ptr_diff_t which is a 64 bit 
integer. I suspect we eliminate one memory fetch for each of these and reduce 
the register loading by reducing the number of registers needed, two for one 
situation. I will change back and retest.

and Paul commeneted "-ftree-vectorize turns on -ftree-loop-vectorize and
-ftree-slp-vectorize already."

I will remove those to options and keep -ftree-vectorize

I will report back my findings.

Thanks, and a fine tooth comb is a very good thing.

Jerry
Jerry DeLisle Nov. 15, 2016, 4:37 p.m. UTC | #4
On 11/15/2016 07:59 AM, Jerry DeLisle wrote:
> On 11/14/2016 11:22 PM, Thomas Koenig wrote:
>> Hi Jerry,
>>
>>> With these changes, OK for trunk?
>>
>> Just going over this with a fine comb...
>>
>> One thing just struck me:   The loop variables should be index_type, so
>>
>>       const index_type m = xcount, n = ycount, k = count;
>>
>> [...]
>>
>>    index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
>>           i3, i4, i5, i6;
>>
>>       /* Local variables */
>>       GFC_REAL_4 t1[65536], /* was [256][256] */
>>          f11, f12, f21, f22, f31, f32, f41, f42,
>>          f13, f14, f23, f24, f33, f34, f43, f44;
>>       index_type i, j, l, ii, jj, ll;
>>       index_type isec, jsec, lsec, uisec, ujsec, ulsec;
>>
>> I agree that we should do the tuning of the inline limit
>> separately.
>>
>
> Several of my iterations used index_type. I found using integer gives better
> performance. The reason is that they are of type ptr_diff_t which is a 64 bit
> integer. I suspect we eliminate one memory fetch for each of these and reduce
> the register loading by reducing the number of registers needed, two for one
> situation. I will change back and retest.
>
> and Paul commeneted "-ftree-vectorize turns on -ftree-loop-vectorize and
> -ftree-slp-vectorize already."
>
> I will remove those to options and keep -ftree-vectorize
>
> I will report back my findings.
>

Changed back to index_type, all OK, must have been some OS stuff running in the 
background.

All comments incorporated. Standing by for approval.

Jerry
Janne Blomqvist Nov. 15, 2016, 7:52 p.m. UTC | #5
On Tue, Nov 15, 2016 at 6:37 PM, Jerry DeLisle <jvdelisle@charter.net> wrote:
> All comments incorporated. Standing by for approval.

Looks good, nice job! Ok for trunk.

I was thinking that for strided arrays, it probably is faster to copy
them to dense arrays before doing the matrix multiplication. That
would also enable using an optimized blas (-fexternal-blas) for
strided arrays. But this is of course nothing that blocks this patch,
just something that might be worth looking into in the future.
Jerry DeLisle Nov. 16, 2016, 9:58 p.m. UTC | #6
Committed after approval on bugzilla to eliminate warnings.

2016-11-16  Jerry DeLisle  <jvdelisle@gcc.gnu.org>

	PR libgfortran/51119
	* Makefile.am: Remove -fno-protect-parens -fstack-arrays.
	* Makefile.in: Regenerate.


r242517 = 026291bdda18395d7c746856dd7e4ed384856a1b (refs/remotes/svn/trunk)
	M	libgfortran/Makefile.in
	M	libgfortran/ChangeLog
	M	libgfortran/Makefile.am

Regards,

Jerry
diff mbox

Patch

diff --git a/libgfortran/Makefile.am b/libgfortran/Makefile.am
index 39d3e11..9ee17f9 100644
--- a/libgfortran/Makefile.am
+++ b/libgfortran/Makefile.am
@@ -850,7 +850,7 @@  intrinsics/dprod_r8.f90 \
  intrinsics/f2c_specifics.F90

  # Turn on vectorization and loop unrolling for matmul.
-$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize 
-funroll-loops
+$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math 
-fno-protect-parens -fstack-arrays -ftree-vectorize -funroll-loops --param 
max-unroll-times=4 -ftree-loop-vectorize
  # Logical matmul doesn't vectorize.
  $(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops