diff mbox

[fortran,RFC] First steps towards inlining matmul

Message ID 55212B5D.6010309@netcologne.de
State New
Headers show

Commit Message

Thomas Koenig April 5, 2015, 12:32 p.m. UTC
Hello world,

this is a first draft of a patch to inline matmul (PR 37171). This is
preliminary, but functional as far as it goes.  Definitely for the
next stage one :-)

Basically, it takes

c = matmul(a,b)

and converts this into

   BLOCK
     integer i,j,k
     c = 0
     do j=0, size(b,2)-1
       do k=0, size(a, 2)-1
         do i=0, size(a, 1)-1
            c(i * stride(c,1) + lbound(c,1), j * stride(c,2) +
lbound(c,2)) =
	    c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) +
            a(i * stride(a,1) + lbound(a,1), k * stride(a,2) +
lbound(a,2)) *
            b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2))
         end do
       end do
     end do
   END BLOCK

For this, I wrote something like a scalarizer that only operates on the
AST only.  I find it rather straigtforward to use (in contrast to the
one we have), but that may also be due to the fact that I wrote it
myself :-)

I chose zero-based loop variables because I didn't want to special-case
too many strange array bounds, and I trusted the induction variable
optimization and CSE in the middle end and in the RTL optimization.
If this approach suboptimizes something, let me know.

References are frozen to make sure that no additional evaluations are
done on.

The patch to simplify.c is not strictly necessary, but it should open
up some optimization possiblities and, frankly, makes the
-fdump-fortran-optimized output at least halfway readable.

This patch needs to be extended in different directions:

- Currently, it only handles the case of two-dimensional arguments

- Special cases like transpose(a) and spread(a) in the arguments
  arguments should be added, possibly with a reordering of loops
  and addition of some more variables.

- Temporary handling for arguments and results, which does not depend
  on allocatable RHS.  I am thinking of creating nested blocks,
  where the inner block has arrays with the bounds of the
  references.

- Cases like matmul(a+1,b) can also be handled in the loop,
  without a temporary

- This needs command-line arguments, -finline-matmul and
  -finline-matmul-limit=, similar to the handling of BLAS.

- Bounds checking could/should also be handled.  Currently, this
  causes the only regression because of a different error message.
  We need to call gfortran_runtime_error directly from the front end,
  and to be able to tell the bounds-checking code "do not worry about
  bounds checking this, it has already been taken care of".

- I already have some test cases, but for this, I would really like
  to cover all code paths.  This also needs some more work.

So, what do you think about this?

	Thomas

2015-04-05  Thomas Koenig  <tkoenig@gcc.gnu.org>

        * frontend-passes.c (create_var):  Add optional argument
        vname as part of the name.  Split off block creation into
        (insert_block): New function.
        (cfe_expr_0):  Use "fcn" as part of tempoary name.
        (optimize_namesapce):  Call optimize_matmul_assign.
        (combine_array_constructor):  Use "constr" as part of
        temporary name.
        (get_array_inq_function):  New function.
        (is_function_or_op):  New function.
        (has_function_or_op):  New function.
        (freeze_expr):  New function.
        (freeze_references):  New function.
        (convert_to_index_kind):  New function.
        (create_do_loop):  New function.
        (get_operand):  New function.
        (get_size_1):  New function.
        (scalarize_expr):  New function.
        (optimize_matmul_assign): New function.
        * simplify.c (simplify_bound):  Simplify the case of the
        lower bound of an assumed-shape argument.

Comments

Jerry DeLisle April 5, 2015, 1:20 p.m. UTC | #1
On 04/05/2015 05:32 AM, Thomas Koenig wrote:
--- snip ---
>
> So, what do you think about this?
>
> 	Thomas

I am curious about what performance gain results from this? I can see saving a 
library call to our runtime libraries.  Do you have some timing results?

Jerry
Thomas Koenig April 5, 2015, 1:48 p.m. UTC | #2
Hi Jerry,

> I am curious about what performance gain results from this? I can see
> saving a library call to our runtime libraries.  Do you have some timing
> results?

The speedup can be quite drastic for small matrices which can be
completely unrolled by -O3:

b1.f90:

program main
  use b2
  implicit none
  real, dimension(3,3) :: a, b, c
  integer :: i

  call random_number(a)
  call random_number(b)
  do i=1,10**8
     c = matmul(a,b)
     call bar(b,c)
  end do
end program main

b2.f90:

module b2
contains
  subroutine bar(b,c)
    real, dimension(3,3) :: b,c
  end subroutine bar
end module b2

ig25@linux-fd1f:~/Krempel/Matmul> gfortran -O3 -fno-frontend-optimize
b2.f90 b1.f90 && time ./a.out

real    0m15.411s
user    0m15.404s
sys     0m0.001s

ig25@linux-fd1f:~/Krempel/Matmul> gfortran -O3 b2.f90 b1.f90 && time ./a.out

real    0m1.736s
user    0m1.735s
sys     0m0.001s
David Malcolm April 7, 2015, 7:17 p.m. UTC | #3
On Sun, 2015-04-05 at 14:32 +0200, Thomas Koenig wrote:
> Hello world,
> 
> this is a first draft of a patch to inline matmul (PR 37171). This is

(FWIW, the above PR# looks like it should be PR 37131)
diff mbox

Patch

Index: simplify.c
===================================================================
--- simplify.c	(Revision 221757)
+++ simplify.c	(Arbeitskopie)
@@ -3445,6 +3445,25 @@  simplify_bound (gfc_expr *array, gfc_expr *dim, gf
 
  done:
 
+  if (!upper && as->type == AS_ASSUMED_SHAPE && dim)
+    {
+      if (dim->expr_type == EXPR_CONSTANT)
+	{
+	  unsigned long int ndim;
+	  gfc_expr *lower, *res;
+
+	  ndim = mpz_get_si (dim->value.integer) - 1;
+	  lower = as->lower[ndim];
+	  if (lower->expr_type == EXPR_CONSTANT)
+	    {
+	      int nkind = mpz_get_si (kind->value.integer);
+	      res = gfc_copy_expr (lower);
+	      res->ts.kind = nkind;
+	      return res;
+	    }
+	}
+    }
+
   if (as && (as->type == AS_DEFERRED || as->type == AS_ASSUMED_SHAPE
 	     || as->type == AS_ASSUMED_RANK))
     return NULL;