diff mbox

[AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning

Message ID 1452513883-25826-1-git-send-email-james.greenhalgh@arm.com
State New
Headers show

Commit Message

James Greenhalgh Jan. 11, 2016, 12:04 p.m. UTC
Hi,

I've seen a couple of large performance issues caused by expanding
the high-precision reciprocal square root for Cortex-A57, so I'd like
to turn it off by default.

This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
some private microbenchmark kernels which stress the divide/sqrt/multiply
units. It therefore seems to me to be the correct choice to make across
a number of workloads.

Bootstrapped and tested on aarch64-none-linux-gnu with no issues.

OK?

Thanks,
James

---
2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
	AARCH64_EXTRA_TUNE_RECIP_SQRT.

Comments

Philipp Tomsich Jan. 11, 2016, 1:31 p.m. UTC | #1
James,

ok from our side—good to see that this also benefits the A57.

Best,
Philipp.

> On 11 Jan 2016, at 13:04, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> 
> 
> Hi,
> 
> I've seen a couple of large performance issues caused by expanding
> the high-precision reciprocal square root for Cortex-A57, so I'd like
> to turn it off by default.
> 
> This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> some private microbenchmark kernels which stress the divide/sqrt/multiply
> units. It therefore seems to me to be the correct choice to make across
> a number of workloads.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> 
> OK?
> 
> Thanks,
> James
> 
> ---
> 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> 
> 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> 
> <0001-AArch64-Remove-AARCH64_EXTRA_TUNE_RECIP_SQRT-from-Co.patch>
James Greenhalgh Jan. 25, 2016, 11:20 a.m. UTC | #2
On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> 
> Hi,
> 
> I've seen a couple of large performance issues caused by expanding
> the high-precision reciprocal square root for Cortex-A57, so I'd like
> to turn it off by default.
> 
> This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> some private microbenchmark kernels which stress the divide/sqrt/multiply
> units. It therefore seems to me to be the correct choice to make across
> a number of workloads.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> 
> OK?

*Ping*

Thanks,
James

> ---
> 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> 
> 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> 

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 1d5d898..999c9fc 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
>    0,	/* max_case_values.  */
>    0,	/* cache_line_size.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
>  };
>  
>  static const struct tune_params cortexa72_tunings =
James Greenhalgh Feb. 1, 2016, 2 p.m. UTC | #3
On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
> On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> > 
> > Hi,
> > 
> > I've seen a couple of large performance issues caused by expanding
> > the high-precision reciprocal square root for Cortex-A57, so I'd like
> > to turn it off by default.
> > 
> > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> > some private microbenchmark kernels which stress the divide/sqrt/multiply
> > units. It therefore seems to me to be the correct choice to make across
> > a number of workloads.
> > 
> > Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> > 
> > OK?
> 
> *Ping*

*pingx2*

Thanks,
James

> > ---
> > 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> > 
> > 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> > 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> > 
> 
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index 1d5d898..999c9fc 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
> >    0,	/* max_case_values.  */
> >    0,	/* cache_line_size.  */
> >    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> > -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
> > +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
> >  };
> >  
> >  static const struct tune_params cortexa72_tunings =
>
James Greenhalgh Feb. 8, 2016, 10:57 a.m. UTC | #4
On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote:
> On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
> > On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> > > 
> > > Hi,
> > > 
> > > I've seen a couple of large performance issues caused by expanding
> > > the high-precision reciprocal square root for Cortex-A57, so I'd like
> > > to turn it off by default.
> > > 
> > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> > > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> > > some private microbenchmark kernels which stress the divide/sqrt/multiply
> > > units. It therefore seems to me to be the correct choice to make across
> > > a number of workloads.
> > > 
> > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> > > 
> > > OK?
> > 
> > *Ping*
> 
> *pingx2*

*ping^3*

Thanks,
James

> > > ---
> > > 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> > > 
> > > 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> > > 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> > > 
> > 
> > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > > index 1d5d898..999c9fc 100644
> > > --- a/gcc/config/aarch64/aarch64.c
> > > +++ b/gcc/config/aarch64/aarch64.c
> > > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
> > >    0,	/* max_case_values.  */
> > >    0,	/* cache_line_size.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> > > -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> > > -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
> > > +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
> > >  };
> > >  
> > >  static const struct tune_params cortexa72_tunings =
> > 
>
James Greenhalgh Feb. 15, 2016, 10:50 a.m. UTC | #5
On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote:
> On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote:
> > On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
> > > On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > I've seen a couple of large performance issues caused by expanding
> > > > the high-precision reciprocal square root for Cortex-A57, so I'd like
> > > > to turn it off by default.
> > > > 
> > > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> > > > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> > > > some private microbenchmark kernels which stress the divide/sqrt/multiply
> > > > units. It therefore seems to me to be the correct choice to make across
> > > > a number of workloads.
> > > > 
> > > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> > > > 
> > > > OK?
> > > 
> > > *Ping*
> > 
> > *pingx2*
> 
> *ping^3*

*ping^4*

Thanks,
James

> > > > ---
> > > > 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> > > > 
> > > > 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> > > > 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> > > > 
> > > 
> > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > > > index 1d5d898..999c9fc 100644
> > > > --- a/gcc/config/aarch64/aarch64.c
> > > > +++ b/gcc/config/aarch64/aarch64.c
> > > > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
> > > >    0,	/* max_case_values.  */
> > > >    0,	/* cache_line_size.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> > > > -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
> > > > +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
> > > >  };
> > > >  
> > > >  static const struct tune_params cortexa72_tunings =
> > > 
> > 
>
Evandro Menezes Feb. 15, 2016, 5:24 p.m. UTC | #6
On 02/15/16 04:50, James Greenhalgh wrote:
> On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote:
>> On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote:
>>> On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
>>>> On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
>>>>> Hi,
>>>>>
>>>>> I've seen a couple of large performance issues caused by expanding
>>>>> the high-precision reciprocal square root for Cortex-A57, so I'd like
>>>>> to turn it off by default.
>>>>>
>>>>> This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
>>>>> Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
>>>>> some private microbenchmark kernels which stress the divide/sqrt/multiply
>>>>> units. It therefore seems to me to be the correct choice to make across
>>>>> a number of workloads.
>>>>>
>>>>> Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
>>>>>
>>>>> OK?
>>>> *Ping*
>>> *pingx2*
>> *ping^3*
> *ping^4*
>
> Thanks,
> James
>
>>>>> ---
>>>>> 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
>>>>>
>>>>> 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
>>>>> 	AARCH64_EXTRA_TUNE_RECIP_SQRT.
>>>>>
>>>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>>>> index 1d5d898..999c9fc 100644
>>>>> --- a/gcc/config/aarch64/aarch64.c
>>>>> +++ b/gcc/config/aarch64/aarch64.c
>>>>> @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
>>>>>     0,	/* max_case_values.  */
>>>>>     0,	/* cache_line_size.  */
>>>>>     tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>>>>> -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
>>>>> -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
>>>>> +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
>>>>>   };
>>>>>   
>>>>>   static const struct tune_params cortexa72_tunings =
>

James,

There seem to be SPEC CPU2000fp validation issues on A57 when this flag 
is present too.  Though I evaluated the algorithm with a huge random set 
of values, always delivering accuracy around 1ulp, which should be 
enough for CPU2000fp (wit x86-64), I expected the benchmarks to pass.

My suspicion is that the Newton series on AArch64 is probably good only 
for SP.  Then, DP might require an extra round, probably exacerbating 
the performance penalty.

I'd like to try to split this tuning option into one for SP and another 
for DP.  Thoughts?

Thank you,
Marcus Shawcroft Feb. 16, 2016, 8:49 a.m. UTC | #7
On 11 January 2016 at 12:04, James Greenhalgh <james.greenhalgh@arm.com> wrote:

> 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         * config/aarch64/aarch64.c (cortexa57_tunings): Remove
>         AARCH64_EXTRA_TUNE_RECIP_SQRT.
>

OK /Marcus
James Greenhalgh Feb. 16, 2016, 10:28 a.m. UTC | #8
On Mon, Feb 15, 2016 at 11:24:53AM -0600, Evandro Menezes wrote:
> On 02/15/16 04:50, James Greenhalgh wrote:
> >On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote:
> >>On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote:
> >>>On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
> >>>>On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> >>>>>Hi,
> >>>>>
> >>>>>I've seen a couple of large performance issues caused by expanding
> >>>>>the high-precision reciprocal square root for Cortex-A57, so I'd like
> >>>>>to turn it off by default.
> >>>>>
> >>>>>This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> >>>>>Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> >>>>>some private microbenchmark kernels which stress the divide/sqrt/multiply
> >>>>>units. It therefore seems to me to be the correct choice to make across
> >>>>>a number of workloads.
> >>>>>
> >>>>>Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> >>>>>
> >>>>>OK?
> >>>>*Ping*
> >>>*pingx2*
> >>*ping^3*
> >*ping^4*
> >
> >Thanks,
> >James
> >
> >>>>>---
> >>>>>2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> >>>>>
> >>>>>	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
> >>>>>	AARCH64_EXTRA_TUNE_RECIP_SQRT.
> >>>>>
> >>>>>diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> >>>>>index 1d5d898..999c9fc 100644
> >>>>>--- a/gcc/config/aarch64/aarch64.c
> >>>>>+++ b/gcc/config/aarch64/aarch64.c
> >>>>>@@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
> >>>>>    0,	/* max_case_values.  */
> >>>>>    0,	/* cache_line_size.  */
> >>>>>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> >>>>>-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> >>>>>-   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
> >>>>>+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
> >>>>>  };
> >>>>>  static const struct tune_params cortexa72_tunings =
> >
> 
> James,
> 
> There seem to be SPEC CPU2000fp validation issues on A57 when this
> flag is present too.  Though I evaluated the algorithm with a huge
> random set of values, always delivering accuracy around 1ulp, which
> should be enough for CPU2000fp (wit x86-64), I expected the
> benchmarks to pass.
> 
> My suspicion is that the Newton series on AArch64 is probably good
> only for SP.  Then, DP might require an extra round, probably
> exacerbating the performance penalty.
> 
> I'd like to try to split this tuning option into one for SP and
> another for DP.  Thoughts?

I haven't seen validation issues with the default expansion, but with
-mlow-precision-recip-sqrt I do see failures. I think this is to be
expected. I don't support splitting the low-precision flag to
"-mlow-precision-float-recip-sqrt" and "-mlow-precision-double-recip-sqrt",
I think that is pushing a particular set of Spec tuning flags over any
meaningful use case.

I could imagine a case for splitting the internal tuning flag to give
AARCH64_EXTRA_TUNE_SF_RECIP_SQRT and AARCH64_EXTRA_TUNE_DF_RECIP_SQRT, but
I'm not sure I understand the benefits of this? Certainly, I think your goals
for performance (turn on for 64-bit divide/sqrt) would contradict your goals
for accuracy (turn off for 64-bit divide/sqrt).

I'm happy with these flags as they are, but I might be missing a subtelty in
your argument?

Thanks,
James
Evandro Menezes Feb. 16, 2016, 8:46 p.m. UTC | #9
On 02/16/16 04:28, James Greenhalgh wrote:
> On Mon, Feb 15, 2016 at 11:24:53AM -0600, Evandro Menezes wrote:
>> James,
>>
>> There seem to be SPEC CPU2000fp validation issues on A57 when this
>> flag is present too.  Though I evaluated the algorithm with a huge
>> random set of values, always delivering accuracy around 1ulp, which
>> should be enough for CPU2000fp (wit x86-64), I expected the
>> benchmarks to pass.
>>
>> My suspicion is that the Newton series on AArch64 is probably good
>> only for SP.  Then, DP might require an extra round, probably
>> exacerbating the performance penalty.
>>
>> I'd like to try to split this tuning option into one for SP and
>> another for DP.  Thoughts?
> I haven't seen validation issues with the default expansion, but with
> -mlow-precision-recip-sqrt I do see failures. I think this is to be
> expected. I don't support splitting the low-precision flag to
> "-mlow-precision-float-recip-sqrt" and "-mlow-precision-double-recip-sqrt",
> I think that is pushing a particular set of Spec tuning flags over any
> meaningful use case.
>
> I could imagine a case for splitting the internal tuning flag to give
> AARCH64_EXTRA_TUNE_SF_RECIP_SQRT and AARCH64_EXTRA_TUNE_DF_RECIP_SQRT, but
> I'm not sure I understand the benefits of this? Certainly, I think your goals
> for performance (turn on for 64-bit divide/sqrt) would contradict your goals
> for accuracy (turn off for 64-bit divide/sqrt).
>
> I'm happy with these flags as they are, but I might be missing a subtelty in
> your argument?

James,

I'm still in sorting out the data, but, indeed, I see no validation 
issues with the approximate reciprocal square root in CPU2000.

However, working on a patch to use the Newton series for square root 
based on the approximate reciprocal square root (d^1/2 = d * d^-1/2), I 
stumbled at validation errors.  I'll take the discussion to that thread.

Stand by...
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 1d5d898..999c9fc 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -484,8 +484,7 @@  static const struct tune_params cortexa57_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
-   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
 };
 
 static const struct tune_params cortexa72_tunings =