Message ID | 56EB2BDC.30209@samsung.com |
---|---|
State | New |
Headers | show |
Hi Evandro, > For example, though this approximation is improves the performance > noticeably for DF on A57, for SF, not so much, if at all. I'm still skeptical that you ever can get any gain on scalars. I bet the only gain is on 4x vectorized floats. So what I would like to see is this implemented in a more general way. We should be able choose whether to expand depending on the mode - including whether it is vectorized. For example enable on V4SFmode and maybe V2DFmode, but not on any scalars. Then we'd add new CPU tuning settings for division, sqrt and rsqrt (rather than adding lots of extra tune flags). Note the md file should call a function in aarch64.c to decide whether to expand or not (your division approximation patch makes the decision in the md file which does not seem a good idea). Wilco
On 03/18/16 10:21, Wilco Dijkstra wrote: > Hi Evandro, > >> For example, though this approximation is improves the performance >> noticeably for DF on A57, for SF, not so much, if at all. > I'm still skeptical that you ever can get any gain on scalars. I bet the only gain is on > 4x vectorized floats. I created a simple test that loops around an inline asm version of the Newton series using scalar insns and got these results on A57: 1/sqrt(x): 18290898/s Fast: 45896823/s 1/sqrtf(x): 69618490/s Fast: 61865874/s > So what I would like to see is this implemented in a more general way. We should > be able choose whether to expand depending on the mode - including whether it is > vectorized. For example enable on V4SFmode and maybe V2DFmode, but not > on any scalars. > > Then we'd add new CPU tuning settings for division, sqrt and rsqrt (rather than adding lots > of extra tune flags). If I understood you correctly, would something like coarse tuning flags along with target-specific cost or parameters tables be what you have in mind? > Note the md file should call a function in aarch64.c to decide whether to > expand or not (your division approximation patch makes the decision in the md file which > does not seem a good idea). I agree. Will modify it. Thank you,
Evandro Menezes <e.menezes@samsung.com> wrote: > On 03/18/16 10:21, Wilco Dijkstra wrote: > > Hi Evandro, > > > >> For example, though this approximation is improves the performance > >> noticeably for DF on A57, for SF, not so much, if at all. > > I'm still skeptical that you ever can get any gain on scalars. I bet the only gain is on > > 4x vectorized floats. > > I created a simple test that loops around an inline asm version of the > Newton series using scalar insns and got these results on A57: That's pure max throughput rather than answering the question whether it speeds up code that does real work. A test that loads an array of vectors and writes back the unit vectors would be a more realistic scenario. Note our testing showed rsqrt slows down various benchmarks: https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00574.html. > If I understood you correctly, would something like coarse tuning flags > along with target-specific cost or parameters tables be what you have in > mind? Yes, the magic tuning flags can be coarse (on/off is good enough). If we can agree that these expansions are really only useful for 4x vectorized code and not much else then all we need is a function that enables it for those modes. Otherwise we would need per-CPU settings that select which expansions are enabled for which modes (not just single/double). Wilco
On 03/18/16 17:20, Wilco Dijkstra wrote: > Evandro Menezes <e.menezes@samsung.com> wrote: >> On 03/18/16 10:21, Wilco Dijkstra wrote: >>> Hi Evandro, >>> >>>> For example, though this approximation is improves the performance >>>> noticeably for DF on A57, for SF, not so much, if at all. >>> I'm still skeptical that you ever can get any gain on scalars. I bet the only gain is on >>> 4x vectorized floats. >> I created a simple test that loops around an inline asm version of the >> Newton series using scalar insns and got these results on A57: > That's pure max throughput rather than answering the question whether > it speeds up code that does real work. A test that loads an array of vectors and > writes back the unit vectors would be a more realistic scenario. > > Note our testing showed rsqrt slows down various benchmarks: > https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00574.html. I remember having seen that, but my point is that if A57 enabled this for only DF, it might be an overall improvement. >> If I understood you correctly, would something like coarse tuning flags >> along with target-specific cost or parameters tables be what you have in >> mind? > Yes, the magic tuning flags can be coarse (on/off is good enough). If we can > agree that these expansions are really only useful for 4x vectorized code and > not much else then all we need is a function that enables it for those modes. > Otherwise we would need per-CPU settings that select which expansions are > enabled for which modes (not just single/double). Just to be clear, the flags refer to the inner mode, whether scalar or vector. I'm not hopeful that it can be said that this is only useful when vectorized though.
On 03/18/16 18:00, Evandro Menezes wrote: > On 03/18/16 17:20, Wilco Dijkstra wrote: >> Evandro Menezes <e.menezes@samsung.com> wrote: >>> On 03/18/16 10:21, Wilco Dijkstra wrote: >>>> Hi Evandro, >>>> >>>>> For example, though this approximation is improves the performance >>>>> noticeably for DF on A57, for SF, not so much, if at all. >>>> I'm still skeptical that you ever can get any gain on scalars. I >>>> bet the only gain is on >>>> 4x vectorized floats. >>> I created a simple test that loops around an inline asm version of the >>> Newton series using scalar insns and got these results on A57: >> That's pure max throughput rather than answering the question whether >> it speeds up code that does real work. A test that loads an array of >> vectors and >> writes back the unit vectors would be a more realistic scenario. >> >> Note our testing showed rsqrt slows down various benchmarks: >> https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00574.html. > > I remember having seen that, but my point is that if A57 enabled this > for only DF, it might be an overall improvement. > >>> If I understood you correctly, would something like coarse tuning flags >>> along with target-specific cost or parameters tables be what you >>> have in >>> mind? >> Yes, the magic tuning flags can be coarse (on/off is good enough). If >> we can >> agree that these expansions are really only useful for 4x vectorized >> code and >> not much else then all we need is a function that enables it for >> those modes. >> Otherwise we would need per-CPU settings that select which expansions >> are >> enabled for which modes (not just single/double). > > Just to be clear, the flags refer to the inner mode, whether scalar or > vector. > I'm not hopeful that it can be said that this is only useful when > vectorized though. > Ping^1
Evandro Menezes wrote: > > Ping^1 I haven't seen a newer version that incorporates my feedback. To recap what I'd like to see is a more general way to select approximations based on mode. I don't believe that looking at the inner mode works in general, and it doesn't make sense to add internal tune flags for all possible combinations. To give an idea what I mean, it would be easiest to add a single field to the CPU tuning structure that contains a mask for all the combinations. Then we call a single function with approximation kind ie. sqrt, rsqrt, div (x/y), recip (1/x) and mode which uses the CPU tuning field to decide whether it should be inlined. Cheers, Wilco
On Fri, Apr 01, 2016 at 02:47:05PM +0100, Wilco Dijkstra wrote: > Evandro Menezes wrote: > > > > Ping^1 > > I haven't seen a newer version that incorporates my feedback. To recap what > I'd like to see is a more general way to select approximations based on mode. > I don't believe that looking at the inner mode works in general, and it > doesn't make sense to add internal tune flags for all possible combinations. Agreed. I don't think that a flag for each of the cartesian product of {rsqrt,sqrt,div} X {SF,DF,V2SF,V4SF,V2DF} is a scalable solution - that's at least 15 flags we'll need. As I said earlier in the discussion, this particular split (between SF and DF mode) seems strange to me. I'd expect the V4SF vs. SF would also be interesting, and that a distinction between vector modes and scalar modes would be more likely to be useful. > To give an idea what I mean, it would be easiest to add a single field to the > CPU tuning structure that contains a mask for all the combinations. Then we > call a single function with approximation kind ie. sqrt, rsqrt, div (x/y), > recip (1/x) and mode which uses the CPU tuning field to decide whether it > should be inlined. I like the idea of a single cost function. These patches are well and truly on my radar for GCC 7, but as we're still in bugfixing mode (and there's still plenty to do!), I'm not going to get round to giving them a more detailed review until after the release. Feel free to ping them again once GCC 6 has shipped. Thanks, James
On 04/01/16 08:47, Wilco Dijkstra wrote: > Evandro Menezes wrote: >> Ping^1 > I haven't seen a newer version that incorporates my feedback. To recap what > I'd like to see is a more general way to select approximations based on mode. > I don't believe that looking at the inner mode works in general, and it doesn't > make sense to add internal tune flags for all possible combinations. > > To give an idea what I mean, it would be easiest to add a single field to the CPU > tuning structure that contains a mask for all the combinations. Then we call a > single function with approximation kind ie. sqrt, rsqrt, div (x/y), recip (1/x) and > mode which uses the CPU tuning field to decide whether it should be inlined. I think that I have a better idea of what you meant. Thank you,
On 04/01/16 09:06, James Greenhalgh wrote: > On Fri, Apr 01, 2016 at 02:47:05PM +0100, Wilco Dijkstra wrote: >> Evandro Menezes wrote: >>> Ping^1 >> I haven't seen a newer version that incorporates my feedback. To recap what >> I'd like to see is a more general way to select approximations based on mode. >> I don't believe that looking at the inner mode works in general, and it >> doesn't make sense to add internal tune flags for all possible combinations. > Agreed. I don't think that a flag for each of the cartesian product of > {rsqrt,sqrt,div} X {SF,DF,V2SF,V4SF,V2DF} is a scalable solution - that's > at least 15 flags we'll need. > > As I said earlier in the discussion, this particular split (between SF and > DF mode) seems strange to me. I'd expect the V4SF vs. SF would also be > interesting, and that a distinction between vector modes and scalar > modes would be more likely to be useful. > >> To give an idea what I mean, it would be easiest to add a single field to the >> CPU tuning structure that contains a mask for all the combinations. Then we >> call a single function with approximation kind ie. sqrt, rsqrt, div (x/y), >> recip (1/x) and mode which uses the CPU tuning field to decide whether it >> should be inlined. > I like the idea of a single cost function. I'll go with it. > These patches are well and truly on my radar for GCC 7, but as we're still > in bugfixing mode (and there's still plenty to do!), I'm not going to get > round to giving them a more detailed review until after the release. Feel > free to ping them again once GCC 6 has shipped. I've been proposing this change for a few months now and I'd really like to have it in 6. I'd appreciate if you'd consider this request, all things considered. Thank you,
From 95581aefcf324233c3603f4d8232ee18c5836f8a Mon Sep 17 00:00:00 2001 From: Evandro Menezes <e.menezes@samsung.com> Date: Thu, 17 Mar 2016 17:00:03 -0500 Subject: [PATCH] Add precision choices for the reciprocal square root approximation Allow a target to prefer such operation depending on the FP precision. gcc/ * config/aarch64/aarch64-protos.h (AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro. * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask. (AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise. * config/aarch64/aarch64.c (use_rsqrt_p): New argument for the mode. (aarch64_builtin_reciprocal): Devise mode from builtin. (aarch64_optab_supported_p): New argument for the mode. --- gcc/config/aarch64/aarch64-protos.h | 3 +++ gcc/config/aarch64/aarch64-tuning-flags.def | 3 ++- gcc/config/aarch64/aarch64.c | 23 +++++++++++++++-------- 3 files changed, 20 insertions(+), 9 deletions(-) diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index dced209..58e5d73 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -263,6 +263,9 @@ enum aarch64_extra_tuning_flags }; #undef AARCH64_EXTRA_TUNING_OPTION +#define AARCH64_EXTRA_TUNE_APPROX_RSQRT \ + (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF | AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF) + extern struct tune_params aarch64_tune_params; HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned); diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def index 7e45a0c..57d9588 100644 --- a/gcc/config/aarch64/aarch64-tuning-flags.def +++ b/gcc/config/aarch64/aarch64-tuning-flags.def @@ -29,5 +29,6 @@ AARCH64_TUNE_ to give an enum name. */ AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS) -AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT) +AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT_DF) +AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrtf", APPROX_RSQRT_SF) diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 12e498d..e651123 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -7440,12 +7440,16 @@ aarch64_memory_move_cost (machine_mode mode ATTRIBUTE_UNUSED, to optimize 1.0/sqrt. */ static bool -use_rsqrt_p (void) +use_rsqrt_p (machine_mode mode) { return (!flag_trapping_math && flag_unsafe_math_optimizations - && ((aarch64_tune_params.extra_tuning_flags - & AARCH64_EXTRA_TUNE_APPROX_RSQRT) + && ((GET_MODE_INNER (mode) == SFmode + && (aarch64_tune_params.extra_tuning_flags + & AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF)) + || (GET_MODE_INNER (mode) == DFmode + && (aarch64_tune_params.extra_tuning_flags + & AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF)) || flag_mrecip_low_precision_sqrt)); } @@ -7455,9 +7459,12 @@ use_rsqrt_p (void) static tree aarch64_builtin_reciprocal (tree fndecl) { - if (!use_rsqrt_p ()) - return NULL_TREE; - return aarch64_builtin_rsqrt (DECL_FUNCTION_CODE (fndecl)); + machine_mode mode = TYPE_MODE (TREE_TYPE (fndecl)); + + if (use_rsqrt_p (mode)) + return aarch64_builtin_rsqrt (DECL_FUNCTION_CODE (fndecl)); + + return NULL_TREE; } typedef rtx (*rsqrte_type) (rtx, rtx); @@ -13952,13 +13959,13 @@ aarch64_promoted_type (const_tree t) /* Implement the TARGET_OPTAB_SUPPORTED_P hook. */ static bool -aarch64_optab_supported_p (int op, machine_mode, machine_mode, +aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode, optimization_type opt_type) { switch (op) { case rsqrt_optab: - return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p (); + return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p (mode1); default: return true; -- 1.9.1