Message ID | 63f1ffa4-02b5-11c9-b64d-86336733d4b5@charter.net |
---|---|
State | New |
Headers | show |
Hi Jerry,
> With these changes, OK for trunk?
Just going over this with a fine comb...
One thing just struck me: The loop variables should be index_type, so
const index_type m = xcount, n = ycount, k = count;
[...]
index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
i3, i4, i5, i6;
/* Local variables */
GFC_REAL_4 t1[65536], /* was [256][256] */
f11, f12, f21, f22, f31, f32, f41, f42,
f13, f14, f23, f24, f33, f34, f43, f44;
index_type i, j, l, ii, jj, ll;
index_type isec, jsec, lsec, uisec, ujsec, ulsec;
I agree that we should do the tuning of the inline limit
separately.
When we do that, we should think about -Os. With the buffering, we have
much more memory usage in the library function. If -Os is in force,
we should also consider raising the limit for inlining.
Since I was involved in the development, I would like to give others a
few days to raise more comments. If there are none, OK to commit with
the above change within a few days. Of course, somebody else might also
OK this patch :-)
Regards
Thomas
On Mon, Nov 14, 2016 at 11:13 PM, Jerry DeLisle <jvdelisle@charter.net> wrote: > On 11/13/2016 11:03 PM, Thomas Koenig wrote: >> >> Hi Jerry, >> >> I think this >> >> + /* Parameter adjustments */ >> + c_dim1 = m; >> + c_offset = 1 + c_dim1; >> >> should be >> >> + /* Parameter adjustments */ >> + c_dim1 = rystride; >> + c_offset = 1 + c_dim1; >> >> Regarding options for matmul: It is possible to add the >> options to the lines in Makefile.in >> >> # Turn on vectorization and loop unrolling for matmul. >> $(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += >> -ftree-vectorize >> -funroll-loops >> >> This is a great step forward. I think we can close most matmul-related >> PRs once this patch has been applied. >> >> Regards >> >> Thomas >> > > With Thomas suggestion, I can remove the #pragma optimize from the source > code. Doing this: (long lines wrapped as shown) > > diff --git a/libgfortran/Makefile.am b/libgfortran/Makefile.am > index 39d3e11..9ee17f9 100644 > --- a/libgfortran/Makefile.am > +++ b/libgfortran/Makefile.am > @@ -850,7 +850,7 @@ intrinsics/dprod_r8.f90 \ > intrinsics/f2c_specifics.F90 > > # Turn on vectorization and loop unrolling for matmul. > -$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize > -funroll-loops > +$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math > -fno-protect-parens -fstack-arrays -ftree-vectorize -funroll-loops --param > max-unroll-times=4 -ftree-loop-vectorize -ftree-vectorize turns on -ftree-loop-vectorize and -ftree-slp-vectorize already. > # Logical matmul doesn't vectorize. > $(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops > > > Comparing gfortran 6 vs 7: (test program posted in PR51119) > > $ gfc6 -static -Ofast -finline-matmul-limit=32 -funroll-loops --param > max-unroll-times=4 compare.f90 > $ ./a.out > ========================================================= > ================ MEASURED GIGAFLOPS = > ========================================================= > Matmul Matmul > fixed Matmul variable > Size Loops explicit refMatmul assumed explicit > ========================================================= > 2 2000 11.928 0.047 0.082 0.138 > 4 2000 1.455 0.220 0.371 0.316 > 8 2000 1.476 0.737 0.704 1.574 > 16 2000 4.536 3.755 2.825 3.820 > 32 2000 6.070 5.443 3.124 5.158 > 64 2000 5.423 5.355 5.405 5.413 > 128 2000 5.913 5.841 5.917 5.917 > 256 477 5.865 5.252 5.863 5.862 > 512 59 2.794 2.841 2.794 2.791 > 1024 7 1.662 1.356 1.662 1.661 > 2048 1 1.753 1.724 1.753 1.754 > > $ gfc -static -Ofast -finline-matmul-limit=32 -funroll-loops --param > max-unroll-times=4 compare.f90 > $ ./a.out > ========================================================= > ================ MEASURED GIGAFLOPS = > ========================================================= > Matmul Matmul > fixed Matmul variable > Size Loops explicit refMatmul assumed explicit > ========================================================= > 2 2000 12.146 0.042 0.090 0.146 > 4 2000 1.496 0.232 0.384 0.325 > 8 2000 2.330 0.765 0.763 0.965 > 16 2000 4.611 4.120 2.792 3.830 > 32 2000 6.068 5.265 3.102 4.859 > 64 2000 6.527 5.329 6.425 6.495 > 128 2000 8.207 5.643 8.336 8.441 > 256 477 9.210 4.967 9.367 9.299 > 512 59 8.330 2.772 8.422 8.342 > 1024 7 8.430 1.378 8.511 8.424 > 2048 1 8.339 1.718 8.425 8.322 > > I do think we need to adjust the default inline limit and should do this > separately from this patch. > > With these changes, OK for trunk? > > Regards, > > Jerry >
On 11/14/2016 11:22 PM, Thomas Koenig wrote: > Hi Jerry, > >> With these changes, OK for trunk? > > Just going over this with a fine comb... > > One thing just struck me: The loop variables should be index_type, so > > const index_type m = xcount, n = ycount, k = count; > > [...] > > index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, > i3, i4, i5, i6; > > /* Local variables */ > GFC_REAL_4 t1[65536], /* was [256][256] */ > f11, f12, f21, f22, f31, f32, f41, f42, > f13, f14, f23, f24, f33, f34, f43, f44; > index_type i, j, l, ii, jj, ll; > index_type isec, jsec, lsec, uisec, ujsec, ulsec; > > I agree that we should do the tuning of the inline limit > separately. > Several of my iterations used index_type. I found using integer gives better performance. The reason is that they are of type ptr_diff_t which is a 64 bit integer. I suspect we eliminate one memory fetch for each of these and reduce the register loading by reducing the number of registers needed, two for one situation. I will change back and retest. and Paul commeneted "-ftree-vectorize turns on -ftree-loop-vectorize and -ftree-slp-vectorize already." I will remove those to options and keep -ftree-vectorize I will report back my findings. Thanks, and a fine tooth comb is a very good thing. Jerry
On 11/15/2016 07:59 AM, Jerry DeLisle wrote: > On 11/14/2016 11:22 PM, Thomas Koenig wrote: >> Hi Jerry, >> >>> With these changes, OK for trunk? >> >> Just going over this with a fine comb... >> >> One thing just struck me: The loop variables should be index_type, so >> >> const index_type m = xcount, n = ycount, k = count; >> >> [...] >> >> index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, >> i3, i4, i5, i6; >> >> /* Local variables */ >> GFC_REAL_4 t1[65536], /* was [256][256] */ >> f11, f12, f21, f22, f31, f32, f41, f42, >> f13, f14, f23, f24, f33, f34, f43, f44; >> index_type i, j, l, ii, jj, ll; >> index_type isec, jsec, lsec, uisec, ujsec, ulsec; >> >> I agree that we should do the tuning of the inline limit >> separately. >> > > Several of my iterations used index_type. I found using integer gives better > performance. The reason is that they are of type ptr_diff_t which is a 64 bit > integer. I suspect we eliminate one memory fetch for each of these and reduce > the register loading by reducing the number of registers needed, two for one > situation. I will change back and retest. > > and Paul commeneted "-ftree-vectorize turns on -ftree-loop-vectorize and > -ftree-slp-vectorize already." > > I will remove those to options and keep -ftree-vectorize > > I will report back my findings. > Changed back to index_type, all OK, must have been some OS stuff running in the background. All comments incorporated. Standing by for approval. Jerry
On Tue, Nov 15, 2016 at 6:37 PM, Jerry DeLisle <jvdelisle@charter.net> wrote:
> All comments incorporated. Standing by for approval.
Looks good, nice job! Ok for trunk.
I was thinking that for strided arrays, it probably is faster to copy
them to dense arrays before doing the matrix multiplication. That
would also enable using an optimized blas (-fexternal-blas) for
strided arrays. But this is of course nothing that blocks this patch,
just something that might be worth looking into in the future.
Committed after approval on bugzilla to eliminate warnings. 2016-11-16 Jerry DeLisle <jvdelisle@gcc.gnu.org> PR libgfortran/51119 * Makefile.am: Remove -fno-protect-parens -fstack-arrays. * Makefile.in: Regenerate. r242517 = 026291bdda18395d7c746856dd7e4ed384856a1b (refs/remotes/svn/trunk) M libgfortran/Makefile.in M libgfortran/ChangeLog M libgfortran/Makefile.am Regards, Jerry
diff --git a/libgfortran/Makefile.am b/libgfortran/Makefile.am index 39d3e11..9ee17f9 100644 --- a/libgfortran/Makefile.am +++ b/libgfortran/Makefile.am @@ -850,7 +850,7 @@ intrinsics/dprod_r8.f90 \ intrinsics/f2c_specifics.F90 # Turn on vectorization and loop unrolling for matmul. -$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize -funroll-loops +$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math -fno-protect-parens -fstack-arrays -ftree-vectorize -funroll-loops --param max-unroll-times=4 -ftree-loop-vectorize # Logical matmul doesn't vectorize. $(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops