diff mbox

[fortran,RFC] First steps towards inlining matmul

Message ID 4DA1047C-92D3-485A-9457-61655ED14681@lps.ens.fr
State New
Headers show

Commit Message

Dominique d'Humières April 5, 2015, 6:25 p.m. UTC
I have done some timings

(1) with the test given below, before the patch I get (last column in Gflops)

[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
[Book15] f90/bug% time a.out
 Time, MATMUL:    373.708008       373.69497100000001        4.2815668504139435     
 Time, MATMUL:    172.086609       23.117919000000001        69.210381782201068     
545.374u 0.537s 6:36.93 137.5%	0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90  -framework Accelerate
[Book15] f90/bug% time a.out
 Time, MATMUL:    176.327881       23.855111999999998        67.071577781735002     
 Time, MATMUL:    182.086746       24.453551000000001        65.430170039516952     
358.059u 0.471s 0:48.43 740.2%	0+0k 0+0io 0pf+0w

after the patch

[Book15] f90/bug% time a.out
 Time, MATMUL:    392.415436       392.41728799999999        4.0772923337669056     
 Time, MATMUL:    171.690399       22.905118000000002        69.853383859450091     
563.671u 0.551s 6:55.44 135.8%	0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90 -framework Accelerate
[Book15] f90/bug% time a.out
 Time, MATMUL:    392.850342       392.84190799999999        4.0728852177349673     
 Time, MATMUL:    174.534821       23.784797000000001        67.269861500184334     
566.824u 0.674s 6:56.74 136.1%	0+0k 0+0io 0pf+0w

which means that -fexternal-blas should disable the inlining.

program t2 
implicit none 
REAL time_begin, time_end 
integer, parameter :: n=2000; 
integer(8) :: ts, te, rate8, cmax8
real(8) :: elapsed
REAL(8) :: a(n,n), b(n,n), c(n,n) 
integer, parameter :: m = 100 
integer :: i 
call RANDOM_NUMBER(a) 
call RANDOM_NUMBER(b) 
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
    a(1,1) = a(1,1) + 0.1 
    c = MATMUL(a,b) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed)
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
    a(1,1) = a(1,1) + 0.1 
    call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed)
end program 
! http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197

(2) The polyhedron test compiled with -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto runs in ~6.5s. After applying the patch below, it runs in ~66s with/without the patch.

Dominique


> Le 5 avr. 2015 à 16:33, Thomas Koenig <tkoenig@netcologne.de> a écrit :
> 
> Hi Dominique,
> 
>> IMO the inlining of MATMUL should be restricted to small matrices (less than 4x4, 9x9
>> or 16x16 depending of your field!-)
> 
> The problem with the library function we have is that it is quite
> general; it can deal with all the complexity of assumed-shape array
> arguments.  Inlining allows the compiler to take advantage of
> contiguous memory, compile-time knowledge of array sizes, etc.
> to allow for vectorization or even complete unrolling.
> 
> Of course, if you use c=matmul(a,b) where all three are assumed-shape
> arrays, your advantage over the library function will be minimal at
> best.
> 
> It will be interesting to see at what size calling BLAS directly
> will be better.
> 
> 	Thomas

Comments

Thomas Koenig April 5, 2015, 8:37 p.m. UTC | #1
Hi Dominique,

> which means that -fexternal-blas should disable the inlining.

It is not surprising that a higly tuned BLAS library is better than
a simple inlining for large matrices.

I did some tests by adjusting n; it seems the inline version is
faster for n<=22, which is not too bad.

Regarding your other test case:  This tests matrix*vector
multiplication, which is not implemented yet :-)

Regards,

	Thomas
Dominique d'Humières April 5, 2015, 11:15 p.m. UTC | #2
The patch causes the following regressions:

FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single  -O2  -latomic (internal compiler error)
FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single  -O2  -latomic (test for excess errors)
FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=lib  -O2  -lcaf_single -latomic (internal compiler error)
FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=lib  -O2  -lcaf_single -latomic (test for excess errors)
FAIL: gfortran.dg/array_function_3.f90   -O  (internal compiler error)
FAIL: gfortran.dg/array_function_3.f90   -O  (test for excess errors)
FAIL: gfortran.dg/bind_c_vars.f90   -g -flto  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O0  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O0  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O1  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O1  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O2  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O2  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer -funroll-loops  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer -funroll-loops  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O3 -g  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O3 -g  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -Os  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -Os  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -g -flto  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O0  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O1  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O1  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O2  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O2  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer -funroll-loops  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer -funroll-loops  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O3 -g  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O3 -g  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -Os  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -Os  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -g -flto  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -g -flto  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O0  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O0  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O1  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O1  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O2  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O2  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer -funroll-loops  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer -funroll-loops  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O3 -g  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O3 -g  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -Os  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -Os  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -g -flto  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -g -flto  (test for excess errors)

I think the line

  if (!upper && as->type == AS_ASSUMED_SHAPE && dim)

should be something such as (untested)

  if (!upper && dim && as && as->type == AS_ASSUMED_SHAPE)

Dominique

> Le 5 avr. 2015 à 22:37, Thomas Koenig <tkoenig@netcologne.de> a écrit :
> 
> Hi Dominique,
> 
>> which means that -fexternal-blas should disable the inlining.
> 
> It is not surprising that a higly tuned BLAS library is better than
> a simple inlining for large matrices.
> 
> I did some tests by adjusting n; it seems the inline version is
> faster for n<=22, which is not too bad.
> 
> Regarding your other test case:  This tests matrix*vector
> multiplication, which is not implemented yet :-)
> 
> Regards,
> 
> 	Thomas
Dominique d'Humières April 6, 2015, 12:18 p.m. UTC | #3
> Le 6 avr. 2015 à 01:15, Dominique d'Humières <dominiq@lps.ens.fr> a écrit :
> 
> The patch causes the following regressions:
> 
> FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single  -O2  -latomic (internal compiler error)
> …
> FAIL: gfortran.dg/bound_8.f90   -g -flto  (test for excess errors)
> 
> I think the line
> 
>  if (!upper && as->type == AS_ASSUMED_SHAPE && dim)
> 
> should be something such as (untested)
> 
>  if (!upper && dim && as && as->type == AS_ASSUMED_SHAPE)

This fixes a first batch of ICEs. A second one if fixed by

          if (kind && lower->expr_type == EXPR_CONSTANT)

While the first change is obvious, I am not sure about the second one.

With this changes I am left with the following regressions

FAIL: gfortran.dg/function_optimize_1.f90   -O   scan-tree-dump-times original "matmul_r4" 1
FAIL: gfortran.dg/function_optimize_7.f90   -O   scan-tree-dump-times original "matmul_r4" 1
FAIL: gfortran.dg/function_optimize_2.f90   -O   scan-tree-dump-times original "matmul_r4" 1
FAIL: gfortran.dg/matmul_bounds_2.f90   -O1  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -O2  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -O3 -fomit-frame-pointer  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -O3 -fomit-frame-pointer -funroll-loops  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -O3 -g  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -Os  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O1  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O2  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O3 -fomit-frame-pointer  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O3 -fomit-frame-pointer -funroll-loops  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O3 -g  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -Os  output pattern test
FAIL: gfortran.dg/realloc_on_assign_11.f90   -O3 -fomit-frame-pointer  execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90   -O3 -fomit-frame-pointer -funroll-loops  execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90   -O3 -g  execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90   -Os  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O1  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O2  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O3 -fomit-frame-pointer  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O3 -fomit-frame-pointer -funroll-loops  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O3 -g  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -Os  execution test

The gfortran.dg/matmul_bounds_* are failures due to a change in the run time error (is there a good reason for that?) and the gfortran.dg/realloc_on_assign_* failures are due to segfaults at run time.

Dominique

>> Le 5 avr. 2015 à 22:37, Thomas Koenig <tkoenig@netcologne.de> a écrit :
>> 
>> Hi Dominique,
>> 
>>> which means that -fexternal-blas should disable the inlining.
>> 
>> It is not surprising that a higly tuned BLAS library is better than
>> a simple inlining for large matrices.
>> 
>> I did some tests by adjusting n; it seems the inline version is
>> faster for n<=22, which is not too bad.
>> 
>> Regarding your other test case:  This tests matrix*vector
>> multiplication, which is not implemented yet :-)
>> 
>> Regards,
>> 
>> 	Thomas
>
diff mbox

Patch

--- induct.f90	2005-10-11 22:53:32.000000000 +0200
+++ induct_vmc.f90	2015-04-05 19:06:30.000000000 +0200
@@ -1644,18 +1644,17 @@  contains
           coil_tmp_vector(1) = -sin(theta)
           coil_tmp_vector(2) = cos(theta)
           coil_tmp_vector(3) = 0.0_longreal
-          coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
-          coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
-          coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+          coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
 !
           do j = 1, 9
               c_vector(3) = 0.5 * h_coil * z1gauss(j)
 !
 !       rotate coil vector into the global coordinate system and translate it
 !
-              rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
-              rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
-              rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+              rot_c_vector = matmul(rotate_coil,c_vector)
+              rot_c_vector(1) = rot_c_vector(1) + dx
+              rot_c_vector(2) = rot_c_vector(2) + dy
+              rot_c_vector(3) = rot_c_vector(3) + dz
 !
               do k = 1, 9
                   q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -1664,9 +1663,7 @@  contains
 !
 !       rotate quad vector into the global coordinate system
 !
-                  rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
-                  rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
-                  rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+                  rot_q_vector = matmul(rotate_quad,q_vector)
 !
 !       compute and add in quadrature term
 !
@@ -1756,18 +1753,17 @@  contains
           coil_tmp_vector(1) = -sin(theta)
           coil_tmp_vector(2) = cos(theta)
           coil_tmp_vector(3) = 0.0_longreal
-          coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
-          coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
-          coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+          coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
 !
           do j = 1, 9
               c_vector(3) = 0.5 * h_coil * z1gauss(j)
 !
 !       rotate coil vector into the global coordinate system and translate it
 !
-              rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
-              rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
-              rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+              rot_c_vector = matmul(rotate_coil,c_vector)
+              rot_c_vector(1) = rot_c_vector(1) + dx
+              rot_c_vector(2) = rot_c_vector(2) + dy
+              rot_c_vector(3) = rot_c_vector(3) + dz
 !
               do k = 1, 9
                   q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -1776,9 +1772,7 @@  contains
 !
 !       rotate quad vector into the global coordinate system
 !
-                  rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
-                  rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
-                  rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+                  rot_q_vector = matmul(rotate_quad,q_vector)
 !
 !       compute and add in quadrature term
 !
@@ -2061,18 +2055,17 @@  contains
 !
 !       compute current vector for the coil in the global coordinate system
 !
-          coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
-          coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
-          coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+          coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
 !
           do j = 1, 9
               c_vector(3) = 0.5 * h_coil * z1gauss(j)
 !
 !       rotate coil vector into the global coordinate system and translate it
 !
-              rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
-              rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
-              rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+              rot_c_vector = matmul(rotate_coil,c_vector)
+              rot_c_vector(1) = rot_c_vector(1) + dx
+              rot_c_vector(2) = rot_c_vector(2) + dy
+              rot_c_vector(3) = rot_c_vector(3) + dz
 !
               do k = 1, 9
                   q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -2081,9 +2074,7 @@  contains
 !
 !       rotate quad vector into the global coordinate system
 !
-                  rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
-                  rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
-                  rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+                  rot_q_vector = matmul(rotate_quad,q_vector)
 !
 !       compute and add in quadrature term
 !
@@ -2204,18 +2195,17 @@  contains
 !
 !       compute current vector for the coil in the global coordinate system
 !
-          coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
-          coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
-          coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+          coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
 !
           do j = 1, 9
               c_vector(3) = 0.5 * h_coil * z1gauss(j)
 !
 !       rotate coil vector into the global coordinate system and translate it
 !
-              rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
-              rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
-              rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+              rot_c_vector = matmul(rotate_coil,c_vector)
+              rot_c_vector(1) = rot_c_vector(1) + dx
+              rot_c_vector(2) = rot_c_vector(2) + dy
+              rot_c_vector(3) = rot_c_vector(3) + dz
 !
               do k = 1, 9
                   q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -2224,9 +2214,7 @@  contains
 !
 !       rotate quad vector into the global coordinate system
 !
-                  rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
-                  rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
-                  rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+                  rot_q_vector = matmul(rotate_quad,q_vector)
 !
 !       compute and add in quadrature term
 !