[Fortran,v2] Use MIN/MAX_EXPR for min/max intrinsics

Message ID	5B4F21E0.3060307@foss.arm.com
State	New
Headers	show Return-Path: <gcc-patches-return-481794-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:references :in-reply-to:content-type; q=dns; s=default; b=Ev9283GlCNgg3X30e 7mA4k+iUPoI6IoHj+kgQJwklcGf6Lfspe0R4Dgci0qgR+IGL230NtvF+9ZWFVsKY DpigQIa5ifXxxGVpazw+6C2iq/lX7SVZ5A/Y9jq43dsLEgS3oc8XoNQMm5pV2KHe TM+zD75LQT2NCzKtRKs9ZH9//s= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Message-ID: <5B4F21E0.3060307@foss.arm.com> Date: Wed, 18 Jul 2018 12:17:52 +0100 From: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Janne Blomqvist <blomqvist.janne@gmail.com>, Thomas Koenig <tkoenig@netcologne.de> CC: Richard Biener <richard.guenther@gmail.com>, "fortran@gcc.gnu.org" <fortran@gcc.gnu.org>, GCC Patches <gcc-patches@gcc.gnu.org> Subject: [PATCH][Fortran][v2] Use MIN/MAX_EXPR for min/max intrinsics References: <5B4DE283.9060100@foss.arm.com> <CAFiYyc2F_H1bSCQg+caLQr8WnqExtkAVyAhaQMky_HbZCC=5hQ@mail.gmail.com> <5B4DF325.2050609@foss.arm.com> <9d0cf3dc-8c5c-bbb2-960c-386b2c936a50@netcologne.de> <CAO9iq9GYUswbCaU0_WOZ5tppNY7RU7mF26gEqA19F8aWMJtqvw@mail.gmail.com> <CAO9iq9HGgS5rXuHOVqY6F_uBGE8uakiLd5SMQEK3eenCC=EaTA@mail.gmail.com> In-Reply-To: <CAO9iq9HGgS5rXuHOVqY6F_uBGE8uakiLd5SMQEK3eenCC=EaTA@mail.gmail.com> Content-Type: multipart/mixed; boundary="------------010307020003030707070007"
Series	[Fortran,v2] Use MIN/MAX_EXPR for min/max intrinsics \| expand [Fortran,v2] Use MIN/MAX_EXPR for min/max intrinsics

Kyrill Tkachov July 18, 2018, 11:17 a.m. UTC

Hi all,

Thank you for the feedback so far.
This version of the patch doesn't try to emit fmin/fmax function calls but instead
emits MIN/MAX_EXPR sequences unconditionally.
I think a source of confusion in the original proposal (for me at least) was
that on aarch64 (that I primarily work on) we implement the fmin/fmax optabs
and therefore these calls are expanded to a single instruction.
But on x86_64 these optabs are not implemented and therefore expand to actual library calls.
Therefore at -O3 (no -ffast-math) I saw a gain on aarch64. But I measured today
on x86_64 and saw a regression.

Thomas and Janne suggested that the Fortran standard does not impose a requirement
on NaN handling for the min/max intrinsics, which would make emitting MIN/MAX_EXPR
sequences unconditionally a valid approach.

However, the gfortran.dg/nan_1.f90 test checks that handling of NaN values in
these intrinsics follows the IEEE semantics (min (nan, 2.0) == 2.0, for example).
This is not required by the standard, but is the existing gfortran behaviour.

If we end up always emitting MIN/MAX_EXPR sequences, like this version of the patch does,
then that test fails on some configurations of x86_64 and not others (for me it FAILs
by default, but passes with -march=native on my machine) and passes on AArch64.
This is expected since MIN/MAX_EXPR doesn't enforce IEEE behaviour on its arguments.

However, by always emitting MIN/MAX_EXPR the gfc_conv_intrinsic_minmax function is
simplified and, perhaps more importantly, generates faster code in the -O3 case.
With this patch I see performance improvement on 521.wrf on both AArch64 (3.7%)
and x86_64 (5.4%).

Thomas, Janne, would this relaxation of NaN handling be acceptable given the benefits
mentioned above? If so, what would be the recommended adjustment to the nan_1.f90 test?

Thanks,
Kyrill

2018-07-18  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * trans-intrinsic.c: (gfc_conv_intrinsic_minmax): Emit MIN_MAX_EXPR
     sequence to calculate the min/max.

2018-07-18  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gfortran.dg/max_float.f90: New test.
     * gfortran.dg/min_float.f90: Likewise.
     * gfortran.dg/minmax_integer.f90: Likewise.

Thomas König July 18, 2018, 1:26 p.m. UTC | #1

Hi Kyrlll,

> Am 18.07.2018 um 13:17 schrieb Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>:
> 
> Thomas, Janne, would this relaxation of NaN handling be acceptable given the benefits
> mentioned above? If so, what would be the recommended adjustment to the nan_1.f90 test?

I would be a bit careful about changing behavior in such a major way. What would the results with NaN and infinity then be, with or without optimization? Would the results be consistent with min(nan,num) vs min(num,nan)? Would they be consistent with the new IEEE standard?

In general, I think that min(nan,num) should be nan and that our current behavior is not the best.

Does anybody have dats points on how this is handled by other compilers?

Oh, and if anything is changed, then compile and runtime behavior should always be the same.

Regards, Thomas

Kyrill Tkachov July 18, 2018, 2:03 p.m. UTC | #2

On 18/07/18 14:26, Thomas König wrote:
> Hi Kyrlll,
>
>> Am 18.07.2018 um 13:17 schrieb Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>:
>>
>> Thomas, Janne, would this relaxation of NaN handling be acceptable given the benefits
>> mentioned above? If so, what would be the recommended adjustment to the nan_1.f90 test?
> I would be a bit careful about changing behavior in such a major way. What would the results with NaN and infinity then be, with or without optimization? Would the results be consistent with min(nan,num) vs min(num,nan)? Would they be consistent with the new IEEE standard?
>
> In general, I think that min(nan,num) should be nan and that our current behavior is not the best.
>
> Does anybody have dats points on how this is handled by other compilers?
>
> Oh, and if anything is changed, then compile and runtime behavior should always be the same.

Thanks, that makes it clearer what behaviour is accceptable.

So this v3 patch follows Richard Sandiford's suggested approach of emitting IFN_FMIN/FMAX
when dealing with floating-point values and NaN handling is important and the target
supports the IFN_FMIN/FMAX. Otherwise the current explicit comparison sequence is emitted.
For integer types and -ffast-math floating-point it will emit MIN/MAX_EXPR.

With this patch the nan_1.f90 behaviour is preserved on all targets, we get the optimal
sequence on aarch64 and on x86_64 we avoid the function call, with no changes in code generation.

This gives the performance improvement on 521.wrf on aarch64 and leaves it unchanged on x86_64.

I'm hoping this addresses all the concerns raised in this thread:
* The NaN-handling behaviour is unchanged on all platforms.
* The fast inline sequence is emitted where it is available.
* No calls to library fmin*/fmax* are emitted where there were none.
* MIN/MAX_EXPR sequence are emitted where possible.

Is this acceptable?

Thanks,
Kyrill

2018-07-18  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * trans-intrinsic.c: (gfc_conv_intrinsic_minmax): Emit MIN_MAX_EXPR
     or IFN_FMIN/FMAX sequence to calculate the min/max when possible.

2018-07-18  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gfortran.dg/max_fmaxl_aarch64.f90: New test.
     * gfortran.dg/min_fminl_aarch64.f90: Likewise.
     * gfortran.dg/minmax_integer.f90: Likewise.
diff --git a/gcc/fortran/trans-intrinsic.c b/gcc/fortran/trans-intrinsic.c
index d306e3a5a6209c1621d91f99ffc366acecd9c3d0..6f5700f2a421d2a735d77c4c4ec0c4c9c058e727 100644
--- a/gcc/fortran/trans-intrinsic.c
+++ b/gcc/fortran/trans-intrinsic.c
@@ -31,6 +31,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "trans.h"
 #include "stringpool.h"
 #include "fold-const.h"
+#include "internal-fn.h"
 #include "tree-nested.h"
 #include "stor-layout.h"
 #include "toplev.h"	/* For rest_of_decl_compilation.  */
@@ -3874,14 +3875,15 @@ gfc_conv_intrinsic_ttynam (gfc_se * se, gfc_expr * expr)
     minmax (a1, a2, a3, ...)
     {
       mvar = a1;
-      if (a2 .op. mvar || isnan (mvar))
-        mvar = a2;
-      if (a3 .op. mvar || isnan (mvar))
-        mvar = a3;
+      mvar = COMP (mvar, a2)
+      mvar = COMP (mvar, a3)
       ...
-      return mvar
+      return mvar;
     }
- */
+    Where COMP is MIN/MAX_EXPR for integral types or when we don't
+    care about NaNs, or IFN_FMIN/MAX when the target has support for
+    fast NaN-honouring min/max.  When neither holds expand a sequence
+    of explicit comparisons.  */
 
 /* TODO: Mismatching types can occur when specific names are used.
    These should be handled during resolution.  */
@@ -3891,7 +3893,6 @@ gfc_conv_intrinsic_minmax (gfc_se * se, gfc_expr * expr, enum tree_code op)
   tree tmp;
   tree mvar;
   tree val;
-  tree thencase;
   tree *args;
   tree type;
   gfc_actual_arglist *argexpr;
@@ -3912,55 +3913,77 @@ gfc_conv_intrinsic_minmax (gfc_se * se, gfc_expr * expr, enum tree_code op)
 
   mvar = gfc_create_var (type, "M");
   gfc_add_modify (&se->pre, mvar, args[0]);
-  for (i = 1, argexpr = argexpr->next; i < nargs; i++)
-    {
-      tree cond, isnan;
 
+  internal_fn ifn = op == GT_EXPR ? IFN_FMAX : IFN_FMIN;
+
+  for (i = 1, argexpr = argexpr->next; i < nargs; i++, argexpr = argexpr->next)
+    {
+      tree cond = NULL_TREE;
       val = args[i];
 
       /* Handle absent optional arguments by ignoring the comparison.  */
       if (argexpr->expr->expr_type == EXPR_VARIABLE
 	  && argexpr->expr->symtree->n.sym->attr.optional
 	  && TREE_CODE (val) == INDIRECT_REF)
-	cond = fold_build2_loc (input_location,
+	{
+	  cond = fold_build2_loc (input_location,
 				NE_EXPR, logical_type_node,
 				TREE_OPERAND (val, 0),
 			build_int_cst (TREE_TYPE (TREE_OPERAND (val, 0)), 0));
-      else
-      {
-	cond = NULL_TREE;
-
+	}
+      else if (!VAR_P (val) && !TREE_CONSTANT (val))
 	/* Only evaluate the argument once.  */
-	if (!VAR_P (val) && !TREE_CONSTANT (val))
-	  val = gfc_evaluate_now (val, &se->pre);
-      }
+	val = gfc_evaluate_now (val, &se->pre);
 
-      thencase = build2_v (MODIFY_EXPR, mvar, convert (type, val));
+      tree calc;
+      /* If we dealing with integral types or we don't care about NaNs
+	 just do a MIN/MAX_EXPR.  */
+      if (!HONOR_NANS (type) && !HONOR_SIGNED_ZEROS (type))
+	{
+
+	  tree_code code = op == GT_EXPR ? MAX_EXPR : MIN_EXPR;
+	  calc = fold_build2_loc (input_location, code, type,
+				  convert (type, val), mvar);
+	  tmp = build2_v (MODIFY_EXPR, mvar, calc);
 
-      tmp = fold_build2_loc (input_location, op, logical_type_node,
-			     convert (type, val), mvar);
+	}
+      /* If we care about NaNs and we have internal functions available for
+	 fmin/fmax to perform the comparison, use those.  */
+      else if (SCALAR_FLOAT_TYPE_P (type)
+	      && direct_internal_fn_supported_p (ifn, type, OPTIMIZE_FOR_SPEED))
+	{
+	  calc = build_call_expr_internal_loc (input_location, ifn, type,
+				      2, mvar, convert (type, val));
+	  tmp = build2_v (MODIFY_EXPR, mvar, calc);
 
-      /* FIXME: When the IEEE_ARITHMETIC module is implemented, the call to
-	 __builtin_isnan might be made dependent on that module being loaded,
-	 to help performance of programs that don't rely on IEEE semantics.  */
-      if (FLOAT_TYPE_P (TREE_TYPE (mvar)))
+	}
+      /* Otherwise expand to:
+	mvar = a1;
+	if (a2 .op. mvar || isnan (mvar))
+	  mvar = a2;
+	if (a3 .op. mvar || isnan (mvar))
+	  mvar = a3;
+	...  */
+      else
 	{
-	  isnan = build_call_expr_loc (input_location,
-				       builtin_decl_explicit (BUILT_IN_ISNAN),
-				       1, mvar);
+	  tree isnan = build_call_expr_loc (input_location,
+					builtin_decl_explicit (BUILT_IN_ISNAN),
+					1, mvar);
+	  tmp = fold_build2_loc (input_location, op, logical_type_node,
+				 convert (type, val), mvar);
+
 	  tmp = fold_build2_loc (input_location, TRUTH_OR_EXPR,
-				 logical_type_node, tmp,
-				 fold_convert (logical_type_node, isnan));
+				  logical_type_node, tmp,
+				  fold_convert (logical_type_node, isnan));
+	  tmp = build3_v (COND_EXPR, tmp,
+			  build2_v (MODIFY_EXPR, mvar, convert (type, val)),
+			  build_empty_stmt (input_location));
 	}
-      tmp = build3_v (COND_EXPR, tmp, thencase,
-		      build_empty_stmt (input_location));
 
       if (cond != NULL_TREE)
 	tmp = build3_v (COND_EXPR, cond, tmp,
 			build_empty_stmt (input_location));
-
       gfc_add_expr_to_block (&se->pre, tmp);
-      argexpr = argexpr->next;
     }
   se->expr = mvar;
 }
diff --git a/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90 b/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90
new file mode 100644
index 0000000000000000000000000000000000000000..8c8ea063e5d0718dc829c1f5574c5b46040e6786
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90
@@ -0,0 +1,9 @@
+! { dg-do compile { target aarch64*-*-* } }
+! { dg-options "-O2 -fdump-tree-optimized" }
+
+subroutine fool (a, b, c, d, e, f, g, h)
+  real (kind=16) :: a, b, c, d, e, f, g, h
+  a = max (a, b, c, d, e, f, g, h)
+end subroutine
+
+! { dg-final { scan-tree-dump-times "__builtin_fmaxl " 7 "optimized" } }
diff --git a/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90 b/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90
new file mode 100644
index 0000000000000000000000000000000000000000..92368917fb48e0c468a16d080ab3a9ac842e01a7
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90
@@ -0,0 +1,9 @@
+! { dg-do compile { target aarch64*-*-* } }
+! { dg-options "-O2 -fdump-tree-optimized" }
+
+subroutine fool (a, b, c, d, e, f, g, h)
+  real (kind=16) :: a, b, c, d, e, f, g, h
+  a = min (a, b, c, d, e, f, g, h)
+end subroutine
+
+! { dg-final { scan-tree-dump-times "__builtin_fminl " 7 "optimized" } }
diff --git a/gcc/testsuite/gfortran.dg/minmax_integer.f90 b/gcc/testsuite/gfortran.dg/minmax_integer.f90
new file mode 100644
index 0000000000000000000000000000000000000000..5b6be38c7055ce4e8620cf75ec7d8a182436b24f
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/minmax_integer.f90
@@ -0,0 +1,15 @@
+! { dg-do compile }
+! { dg-options "-O2 -fdump-tree-optimized" }
+
+subroutine foo (a, b, c, d, e, f, g, h)
+  integer (kind=4) :: a, b, c, d, e, f, g, h
+  a = min (a, b, c, d, e, f, g, h)
+end subroutine
+
+subroutine foof (a, b, c, d, e, f, g, h)
+  integer (kind=4) :: a, b, c, d, e, f, g, h
+  a = max (a, b, c, d, e, f, g, h)
+end subroutine
+
+! { dg-final { scan-tree-dump-times "MIN_EXPR" 7 "optimized" } }
+! { dg-final { scan-tree-dump-times "MAX_EXPR" 7 "optimized" } }

Janne Blomqvist July 18, 2018, 2:55 p.m. UTC | #3

On Wed, Jul 18, 2018 at 5:03 PM, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com
> wrote:

>
> On 18/07/18 14:26, Thomas König wrote:
>
>> Hi Kyrlll,
>>
>> Am 18.07.2018 um 13:17 schrieb Kyrill Tkachov <
>>> kyrylo.tkachov@foss.arm.com>:
>>>
>>> Thomas, Janne, would this relaxation of NaN handling be acceptable given
>>> the benefits
>>> mentioned above? If so, what would be the recommended adjustment to the
>>> nan_1.f90 test?
>>>
>> I would be a bit careful about changing behavior in such a major way.
>> What would the results with NaN and infinity then be, with or without
>> optimization? Would the results be consistent with min(nan,num) vs
>> min(num,nan)? Would they be consistent with the new IEEE standard?
>>
>> In general, I think that min(nan,num) should be nan and that our current
>> behavior is not the best.
>>
>> Does anybody have dats points on how this is handled by other compilers?
>>
>> Oh, and if anything is changed, then compile and runtime behavior should
>> always be the same.
>>
>
> Thanks, that makes it clearer what behaviour is accceptable.
>
> So this v3 patch follows Richard Sandiford's suggested approach of
> emitting IFN_FMIN/FMAX
> when dealing with floating-point values and NaN handling is important and
> the target
> supports the IFN_FMIN/FMAX. Otherwise the current explicit comparison
> sequence is emitted.
> For integer types and -ffast-math floating-point it will emit MIN/MAX_EXPR.
>
> With this patch the nan_1.f90 behaviour is preserved on all targets, we
> get the optimal
> sequence on aarch64 and on x86_64 we avoid the function call, with no
> changes in code generation.
>
> This gives the performance improvement on 521.wrf on aarch64 and leaves it
> unchanged on x86_64.
>
> I'm hoping this addresses all the concerns raised in this thread:
> * The NaN-handling behaviour is unchanged on all platforms.
> * The fast inline sequence is emitted where it is available.
> * No calls to library fmin*/fmax* are emitted where there were none.
> * MIN/MAX_EXPR sequence are emitted where possible.
>
> Is this acceptable?
>

So if I understand it correctly, the "internal fn" thing is a mechanism
that allows to check whether the target supports expanding a builtin inline
or whether it requires a call to an external library function?

If so, then yes, Ok, thanks for the patch!

Janne Blomqvist July 18, 2018, 3:10 p.m. UTC | #4

On Wed, Jul 18, 2018 at 4:26 PM, Thomas König <tk@tkoenig.net> wrote:

> Hi Kyrlll,
>
> > Am 18.07.2018 um 13:17 schrieb Kyrill Tkachov <
> kyrylo.tkachov@foss.arm.com>:
> >
> > Thomas, Janne, would this relaxation of NaN handling be acceptable given
> the benefits
> > mentioned above? If so, what would be the recommended adjustment to the
> nan_1.f90 test?
>
> I would be a bit careful about changing behavior in such a major way. What
> would the results with NaN and infinity then be, with or without
> optimization? Would the results be consistent with min(nan,num) vs
> min(num,nan)? Would they be consistent with the new IEEE standard?
>

AFAIU, MIN/MAX_EXPR do the right thing when comparing a normal number with
Inf. For NaN the result is undefined, and you might indeed have

min(a, NaN) = a
min(NaN, a) = NaN

where "a" is a normal number.

(I think that happens at least on x86 if MIN_EXPR is expanded to
minsd/minpd.

Apparently what the proper result for min(a, NaN) should be is contentious
enough that minnum was removed from the upcoming IEEE 754 revision, and new
operations AFAICS have the semantics

minimum(a, NaN) = minimum(NaN, a) = NaN
minimumNumber(a, NaN) = minimumNumber(NaN, a) = a

That is minimumNumber corresponds to minnum in IEEE 754-2008 and fmin* in
C, and to the current behavior of gfortran.

> In general, I think that min(nan,num) should be nan and that our current
> behavior is not the best.
>

There was some extensive discussion of that in the Julia bug report I
linked to in an earlier message, and they came to the same conclusion and
changed their behavior.

> Does anybody have dats points on how this is handled by other compilers?
>

The only other compiler I have access to at the moment is ifort (and not
the latest version), but maybe somebody has access to a wider variety?

> Oh, and if anything is changed, then compile and runtime behavior should
> always be the same.
>

Well, IFF we place some weight on the runtime behavior being particularly
sensible wrt NaN's, which it wouldn't be if we just use a plain
MIN/MAX_EXPR. Is it worth taking a performance hit for, though? In
particular, if other compilers are inconsistent, we might as well do
whatever is fastest.

Richard Sandiford July 18, 2018, 3:27 p.m. UTC | #5

Thanks for doing this.

Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com> writes:
> +	  calc = build_call_expr_internal_loc (input_location, ifn, type,
> +				      2, mvar, convert (type, val));

(indentation looks off)

> diff --git a/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90 b/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90
> new file mode 100644
> index 0000000000000000000000000000000000000000..8c8ea063e5d0718dc829c1f5574c5b46040e6786
> --- /dev/null
> +++ b/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90
> @@ -0,0 +1,9 @@
> +! { dg-do compile { target aarch64*-*-* } }
> +! { dg-options "-O2 -fdump-tree-optimized" }
> +
> +subroutine fool (a, b, c, d, e, f, g, h)
> +  real (kind=16) :: a, b, c, d, e, f, g, h
> +  a = max (a, b, c, d, e, f, g, h)
> +end subroutine
> +
> +! { dg-final { scan-tree-dump-times "__builtin_fmaxl " 7 "optimized" } }
> diff --git a/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90 b/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90
> new file mode 100644
> index 0000000000000000000000000000000000000000..92368917fb48e0c468a16d080ab3a9ac842e01a7
> --- /dev/null
> +++ b/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90
> @@ -0,0 +1,9 @@
> +! { dg-do compile { target aarch64*-*-* } }
> +! { dg-options "-O2 -fdump-tree-optimized" }
> +
> +subroutine fool (a, b, c, d, e, f, g, h)
> +  real (kind=16) :: a, b, c, d, e, f, g, h
> +  a = min (a, b, c, d, e, f, g, h)
> +end subroutine
> +
> +! { dg-final { scan-tree-dump-times "__builtin_fminl " 7 "optimized" } }

Do these still pass?  I wouldn't have expected us to use __builtin_fmin*
and __builtin_fmax* now.

It would be good to have tests that we use ".FMIN" and ".FMAX" for kind=4
and kind=8 on AArch64, since that's really the end goal here.

Thanks,
Richard

Kyrill Tkachov July 18, 2018, 4:04 p.m. UTC | #6

Hi Richard,

On 18/07/18 16:27, Richard Sandiford wrote:
> Thanks for doing this.
>
> Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com> writes:
>> +	  calc = build_call_expr_internal_loc (input_location, ifn, type,
>> +				      2, mvar, convert (type, val));
> (indentation looks off)
>
>> diff --git a/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90 b/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90
>> new file mode 100644
>> index 0000000000000000000000000000000000000000..8c8ea063e5d0718dc829c1f5574c5b46040e6786
>> --- /dev/null
>> +++ b/gcc/testsuite/gfortran.dg/max_fmaxl_aarch64.f90
>> @@ -0,0 +1,9 @@
>> +! { dg-do compile { target aarch64*-*-* } }
>> +! { dg-options "-O2 -fdump-tree-optimized" }
>> +
>> +subroutine fool (a, b, c, d, e, f, g, h)
>> +  real (kind=16) :: a, b, c, d, e, f, g, h
>> +  a = max (a, b, c, d, e, f, g, h)
>> +end subroutine
>> +
>> +! { dg-final { scan-tree-dump-times "__builtin_fmaxl " 7 "optimized" } }
>> diff --git a/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90 b/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90
>> new file mode 100644
>> index 0000000000000000000000000000000000000000..92368917fb48e0c468a16d080ab3a9ac842e01a7
>> --- /dev/null
>> +++ b/gcc/testsuite/gfortran.dg/min_fminl_aarch64.f90
>> @@ -0,0 +1,9 @@
>> +! { dg-do compile { target aarch64*-*-* } }
>> +! { dg-options "-O2 -fdump-tree-optimized" }
>> +
>> +subroutine fool (a, b, c, d, e, f, g, h)
>> +  real (kind=16) :: a, b, c, d, e, f, g, h
>> +  a = min (a, b, c, d, e, f, g, h)
>> +end subroutine
>> +
>> +! { dg-final { scan-tree-dump-times "__builtin_fminl " 7 "optimized" } }
> Do these still pass?  I wouldn't have expected us to use __builtin_fmin*
> and __builtin_fmax* now.
>
> It would be good to have tests that we use ".FMIN" and ".FMAX" for kind=4
> and kind=8 on AArch64, since that's really the end goal here.

Doh, yes. I had spotted that myself after I had sent out the patch.
I've fixed that and the indentation issue in this small revision.

Given Janne's comments I will commit this tomorrow if there are no objections.
This patch should be a conservative improvement. If the Fortran folks decide
to sacrifice the more predictable NaN handling in favour of more optimisation
leeway by using MIN/MAX_EXPR unconditionally we can do that as a follow-up.

Thanks for the help,
Kyrill

2018-07-18  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * trans-intrinsic.c: (gfc_conv_intrinsic_minmax): Emit MIN_MAX_EXPR
     or IFN_FMIN/FMAX sequence to calculate the min/max when possible.

2018-07-18  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gfortran.dg/max_fmax_aarch64.f90: New test.
     * gfortran.dg/min_fmin_aarch64.f90: Likewise.
     * gfortran.dg/minmax_integer.f90: Likewise.
diff --git a/gcc/fortran/trans-intrinsic.c b/gcc/fortran/trans-intrinsic.c
index d306e3a5a6209c1621d91f99ffc366acecd9c3d0..c9b5479740c3f98f906132fda5c252274c4b6edd 100644
--- a/gcc/fortran/trans-intrinsic.c
+++ b/gcc/fortran/trans-intrinsic.c
@@ -31,6 +31,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "trans.h"
 #include "stringpool.h"
 #include "fold-const.h"
+#include "internal-fn.h"
 #include "tree-nested.h"
 #include "stor-layout.h"
 #include "toplev.h"	/* For rest_of_decl_compilation.  */
@@ -3874,14 +3875,15 @@ gfc_conv_intrinsic_ttynam (gfc_se * se, gfc_expr * expr)
     minmax (a1, a2, a3, ...)
     {
       mvar = a1;
-      if (a2 .op. mvar || isnan (mvar))
-        mvar = a2;
-      if (a3 .op. mvar || isnan (mvar))
-        mvar = a3;
+      mvar = COMP (mvar, a2)
+      mvar = COMP (mvar, a3)
       ...
-      return mvar
+      return mvar;
     }
- */
+    Where COMP is MIN/MAX_EXPR for integral types or when we don't
+    care about NaNs, or IFN_FMIN/MAX when the target has support for
+    fast NaN-honouring min/max.  When neither holds expand a sequence
+    of explicit comparisons.  */
 
 /* TODO: Mismatching types can occur when specific names are used.
    These should be handled during resolution.  */
@@ -3891,7 +3893,6 @@ gfc_conv_intrinsic_minmax (gfc_se * se, gfc_expr * expr, enum tree_code op)
   tree tmp;
   tree mvar;
   tree val;
-  tree thencase;
   tree *args;
   tree type;
   gfc_actual_arglist *argexpr;
@@ -3912,55 +3913,77 @@ gfc_conv_intrinsic_minmax (gfc_se * se, gfc_expr * expr, enum tree_code op)
 
   mvar = gfc_create_var (type, "M");
   gfc_add_modify (&se->pre, mvar, args[0]);
-  for (i = 1, argexpr = argexpr->next; i < nargs; i++)
-    {
-      tree cond, isnan;
 
+  internal_fn ifn = op == GT_EXPR ? IFN_FMAX : IFN_FMIN;
+
+  for (i = 1, argexpr = argexpr->next; i < nargs; i++, argexpr = argexpr->next)
+    {
+      tree cond = NULL_TREE;
       val = args[i];
 
       /* Handle absent optional arguments by ignoring the comparison.  */
       if (argexpr->expr->expr_type == EXPR_VARIABLE
 	  && argexpr->expr->symtree->n.sym->attr.optional
 	  && TREE_CODE (val) == INDIRECT_REF)
-	cond = fold_build2_loc (input_location,
+	{
+	  cond = fold_build2_loc (input_location,
 				NE_EXPR, logical_type_node,
 				TREE_OPERAND (val, 0),
 			build_int_cst (TREE_TYPE (TREE_OPERAND (val, 0)), 0));
-      else
-      {
-	cond = NULL_TREE;
-
+	}
+      else if (!VAR_P (val) && !TREE_CONSTANT (val))
 	/* Only evaluate the argument once.  */
-	if (!VAR_P (val) && !TREE_CONSTANT (val))
-	  val = gfc_evaluate_now (val, &se->pre);
-      }
+	val = gfc_evaluate_now (val, &se->pre);
 
-      thencase = build2_v (MODIFY_EXPR, mvar, convert (type, val));
+      tree calc;
+      /* If we dealing with integral types or we don't care about NaNs
+	 just do a MIN/MAX_EXPR.  */
+      if (!HONOR_NANS (type) && !HONOR_SIGNED_ZEROS (type))
+	{
+
+	  tree_code code = op == GT_EXPR ? MAX_EXPR : MIN_EXPR;
+	  calc = fold_build2_loc (input_location, code, type,
+				  convert (type, val), mvar);
+	  tmp = build2_v (MODIFY_EXPR, mvar, calc);
 
-      tmp = fold_build2_loc (input_location, op, logical_type_node,
-			     convert (type, val), mvar);
+	}
+      /* If we care about NaNs and we have internal functions available for
+	 fmin/fmax to perform the comparison, use those.  */
+      else if (SCALAR_FLOAT_TYPE_P (type)
+	      && direct_internal_fn_supported_p (ifn, type, OPTIMIZE_FOR_SPEED))
+	{
+	  calc = build_call_expr_internal_loc (input_location, ifn, type,
+						2, mvar, convert (type, val));
+	  tmp = build2_v (MODIFY_EXPR, mvar, calc);
 
-      /* FIXME: When the IEEE_ARITHMETIC module is implemented, the call to
-	 __builtin_isnan might be made dependent on that module being loaded,
-	 to help performance of programs that don't rely on IEEE semantics.  */
-      if (FLOAT_TYPE_P (TREE_TYPE (mvar)))
+	}
+      /* Otherwise expand to:
+	mvar = a1;
+	if (a2 .op. mvar || isnan (mvar))
+	  mvar = a2;
+	if (a3 .op. mvar || isnan (mvar))
+	  mvar = a3;
+	...  */
+      else
 	{
-	  isnan = build_call_expr_loc (input_location,
-				       builtin_decl_explicit (BUILT_IN_ISNAN),
-				       1, mvar);
+	  tree isnan = build_call_expr_loc (input_location,
+					builtin_decl_explicit (BUILT_IN_ISNAN),
+					1, mvar);
+	  tmp = fold_build2_loc (input_location, op, logical_type_node,
+				 convert (type, val), mvar);
+
 	  tmp = fold_build2_loc (input_location, TRUTH_OR_EXPR,
-				 logical_type_node, tmp,
-				 fold_convert (logical_type_node, isnan));
+				  logical_type_node, tmp,
+				  fold_convert (logical_type_node, isnan));
+	  tmp = build3_v (COND_EXPR, tmp,
+			  build2_v (MODIFY_EXPR, mvar, convert (type, val)),
+			  build_empty_stmt (input_location));
 	}
-      tmp = build3_v (COND_EXPR, tmp, thencase,
-		      build_empty_stmt (input_location));
 
       if (cond != NULL_TREE)
 	tmp = build3_v (COND_EXPR, cond, tmp,
 			build_empty_stmt (input_location));
-
       gfc_add_expr_to_block (&se->pre, tmp);
-      argexpr = argexpr->next;
     }
   se->expr = mvar;
 }
diff --git a/gcc/testsuite/gfortran.dg/max_fmax_aarch64.f90 b/gcc/testsuite/gfortran.dg/max_fmax_aarch64.f90
new file mode 100644
index 0000000000000000000000000000000000000000..b818241a1f9aa7018efaf300cfecb70f413b7573
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/max_fmax_aarch64.f90
@@ -0,0 +1,15 @@
+! { dg-do compile { target aarch64*-*-* } }
+! { dg-options "-O2 -fdump-tree-optimized" }
+
+subroutine foo (a, b, c, d, e, f, g, h)
+  real (kind=8) :: a, b, c, d, e, f, g, h
+  a = max (a, b, c, d, e, f, g, h)
+end subroutine
+
+subroutine foof (a, b, c, d, e, f, g, h)
+  real (kind=4) :: a, b, c, d, e, f, g, h
+  a = max (a, b, c, d, e, f, g, h)
+end subroutine
+
+
+! { dg-final { scan-tree-dump-times "\.FMAX " 14 "optimized" } }
diff --git a/gcc/testsuite/gfortran.dg/min_fmin_aarch64.f90 b/gcc/testsuite/gfortran.dg/min_fmin_aarch64.f90
new file mode 100644
index 0000000000000000000000000000000000000000..009869b497df7737089971e00c01e1c29c0a3032
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/min_fmin_aarch64.f90
@@ -0,0 +1,15 @@
+! { dg-do compile { target aarch64*-*-* } }
+! { dg-options "-O2 -fdump-tree-optimized" }
+
+subroutine foo (a, b, c, d, e, f, g, h)
+  real (kind=8) :: a, b, c, d, e, f, g, h
+  a = min (a, b, c, d, e, f, g, h)
+end subroutine
+
+
+subroutine foof (a, b, c, d, e, f, g, h)
+  real (kind=4) :: a, b, c, d, e, f, g, h
+  a = min (a, b, c, d, e, f, g, h)
+end subroutine
+
+! { dg-final { scan-tree-dump-times "\.FMIN " 14 "optimized" } }
diff --git a/gcc/testsuite/gfortran.dg/minmax_integer.f90 b/gcc/testsuite/gfortran.dg/minmax_integer.f90
new file mode 100644
index 0000000000000000000000000000000000000000..5b6be38c7055ce4e8620cf75ec7d8a182436b24f
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/minmax_integer.f90
@@ -0,0 +1,15 @@
+! { dg-do compile }
+! { dg-options "-O2 -fdump-tree-optimized" }
+
+subroutine foo (a, b, c, d, e, f, g, h)
+  integer (kind=4) :: a, b, c, d, e, f, g, h
+  a = min (a, b, c, d, e, f, g, h)
+end subroutine
+
+subroutine foof (a, b, c, d, e, f, g, h)
+  integer (kind=4) :: a, b, c, d, e, f, g, h
+  a = max (a, b, c, d, e, f, g, h)
+end subroutine
+
+! { dg-final { scan-tree-dump-times "MIN_EXPR" 7 "optimized" } }
+! { dg-final { scan-tree-dump-times "MAX_EXPR" 7 "optimized" } }

Joseph Myers July 26, 2018, 8:35 p.m. UTC | #7

On Wed, 18 Jul 2018, Janne Blomqvist wrote:

> minimumNumber(a, NaN) = minimumNumber(NaN, a) = a
> 
> That is minimumNumber corresponds to minnum in IEEE 754-2008 and fmin* in

No, it differs in the handling of signaling NaNs (with minimumNumber, if 
the NaN argument is signaling, it results in the "invalid" exception but 
the non-NaN argument is still returned, whereas with minNum, a quiet NaN 
was returned in that case).  A new fminimum_num function is proposed as a 
C binding to the new operation.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2273.pdf

(The new operations are also more strictly defined regarding zero 
arguments, to treat -0 as less than +0, which was unspecified for minNum 
and fmin.)

Janne Blomqvist Aug. 6, 2018, 12:04 p.m. UTC | #8

On Wed, Jul 18, 2018 at 6:10 PM, Janne Blomqvist <blomqvist.janne@gmail.com>
wrote:

> On Wed, Jul 18, 2018 at 4:26 PM, Thomas König <tk@tkoenig.net> wrote:
>
>> Hi Kyrlll,
>>
>> > Am 18.07.2018 um 13:17 schrieb Kyrill Tkachov <
>> kyrylo.tkachov@foss.arm.com>:
>> >
>> > Thomas, Janne, would this relaxation of NaN handling be acceptable
>> given the benefits
>> > mentioned above? If so, what would be the recommended adjustment to the
>> nan_1.f90 test?
>>
>> I would be a bit careful about changing behavior in such a major way.
>> What would the results with NaN and infinity then be, with or without
>> optimization? Would the results be consistent with min(nan,num) vs
>> min(num,nan)? Would they be consistent with the new IEEE standard?
>>
>
> AFAIU, MIN/MAX_EXPR do the right thing when comparing a normal number with
> Inf. For NaN the result is undefined, and you might indeed have
>
> min(a, NaN) = a
> min(NaN, a) = NaN
>
> where "a" is a normal number.
>
> (I think that happens at least on x86 if MIN_EXPR is expanded to
> minsd/minpd.
>
> Apparently what the proper result for min(a, NaN) should be is contentious
> enough that minnum was removed from the upcoming IEEE 754 revision, and new
> operations AFAICS have the semantics
>
> minimum(a, NaN) = minimum(NaN, a) = NaN
> minimumNumber(a, NaN) = minimumNumber(NaN, a) = a
>
> That is minimumNumber corresponds to minnum in IEEE 754-2008 and fmin* in
> C, and to the current behavior of gfortran.
>
>
>> In general, I think that min(nan,num) should be nan and that our current
>> behavior is not the best.
>>
>
> There was some extensive discussion of that in the Julia bug report I
> linked to in an earlier message, and they came to the same conclusion and
> changed their behavior.
>
>
>> Does anybody have dats points on how this is handled by other compilers?
>>
>
> The only other compiler I have access to at the moment is ifort (and not
> the latest version), but maybe somebody has access to a wider variety?
>
>
>> Oh, and if anything is changed, then compile and runtime behavior should
>> always be the same.
>>
>
> Well, IFF we place some weight on the runtime behavior being particularly
> sensible wrt NaN's, which it wouldn't be if we just use a plain
> MIN/MAX_EXPR. Is it worth taking a performance hit for, though? In
> particular, if other compilers are inconsistent, we might as well do
> whatever is fastest.
>
>
> --
> Janne Blomqvist
>


The testcase below (the functions in a separate file to prevent
inter-procedural and constant propagation optimizations):

program main
  implicit none
  real :: a, b = 1., mymax, mydiv
  external mymax, mydiv
  a = mydiv(0., 0.)
  print *, 'Verify that the following value is a NaN: ', a
  print *, 'max(', a, ',', b, ') = ', mymax(a, b)
  print *, 'max(', b, ',', a, ') = ', mymax(b, a)

  a = mydiv(1., 0.)
  print *, 'Verify that the following is a Inf: ', a
  print *, 'max(', a, ',', b, ') = ', mymax(a, b)
  print *, 'max(', b, ',', a, ') = ', mymax(b, a)
end program main

real function mymax(a, b)
  implicit none
  real :: a, b
  mymax = max(a, b)
end function mymax

real function mydiv(a, b)
  implicit none
  real :: a, b
  mydiv = a/b
end function mydiv


With gfortran 6.2 (didn't bother to check other versions as it shouldn't
have changed lately) and Intel Fortran 17.0.1 I get the following:

% gfortran main.f90 my.f90 && ./a.out
 Verify that the following value is a NaN:               NaN
 max(              NaN ,   1.00000000     ) =    1.00000000
 max(   1.00000000     ,              NaN ) =    1.00000000
 Verify that the following is a Inf:          Infinity
 max(         Infinity ,   1.00000000     ) =          Infinity
 max(   1.00000000     ,         Infinity ) =          Infinity

% gfortran -ffast-math main.f90 my.f90 && ./a.out
 Verify that the following value is a NaN:               NaN
 max(              NaN ,   1.00000000     ) =               NaN
 max(   1.00000000     ,              NaN ) =    1.00000000
 Verify that the following is a Inf:          Infinity
 max(         Infinity ,   1.00000000     ) =          Infinity
 max(   1.00000000     ,         Infinity ) =          Infinity


% ifort main.f90 my.f90 && ./a.out
 Verify that the following value is a NaN:             NaN
 max(            NaN ,   1.000000     ) =    1.000000
 max(   1.000000     ,            NaN ) =             NaN
 Verify that the following is a Inf:        Infinity
 max(       Infinity ,   1.000000     ) =        Infinity
 max(   1.000000     ,       Infinity ) =        Infinity


% ifort -fp-model strict main.f90 my.f90 && ./a.out
 Verify that the following value is a NaN:             NaN
 max(            NaN ,   1.000000     ) =    1.000000
 max(   1.000000     ,            NaN ) =             NaN
 Verify that the following is a Inf:        Infinity
 max(       Infinity ,   1.000000     ) =        Infinity
 max(   1.000000     ,       Infinity ) =        Infinity


For brevity I have omitted tests with various -O[N] optimization levels,
which didn't affect the results on either gfortran nor ifort.

This suggests that ifort does the equivalent of MAX_EXPR unconditionally.

Does anyone have access to other compilers, what results do they give?

[Fortran,v2] Use MIN/MAX_EXPR for min/max intrinsics

Commit Message

Comments

Patch