Message ID | 1452513883-25826-1-git-send-email-james.greenhalgh@arm.com |
---|---|
State | New |
Headers | show |
James, ok from our side—good to see that this also benefits the A57. Best, Philipp. > On 11 Jan 2016, at 13:04, James Greenhalgh <james.greenhalgh@arm.com> wrote: > > > Hi, > > I've seen a couple of large performance issues caused by expanding > the high-precision reciprocal square root for Cortex-A57, so I'd like > to turn it off by default. > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for > some private microbenchmark kernels which stress the divide/sqrt/multiply > units. It therefore seems to me to be the correct choice to make across > a number of workloads. > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues. > > OK? > > Thanks, > James > > --- > 2015-12-11 James Greenhalgh <james.greenhalgh@arm.com> > > * config/aarch64/aarch64.c (cortexa57_tunings): Remove > AARCH64_EXTRA_TUNE_RECIP_SQRT. > > <0001-AArch64-Remove-AARCH64_EXTRA_TUNE_RECIP_SQRT-from-Co.patch>
On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote: > > Hi, > > I've seen a couple of large performance issues caused by expanding > the high-precision reciprocal square root for Cortex-A57, so I'd like > to turn it off by default. > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for > some private microbenchmark kernels which stress the divide/sqrt/multiply > units. It therefore seems to me to be the correct choice to make across > a number of workloads. > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues. > > OK? *Ping* Thanks, James > --- > 2015-12-11 James Greenhalgh <james.greenhalgh@arm.com> > > * config/aarch64/aarch64.c (cortexa57_tunings): Remove > AARCH64_EXTRA_TUNE_RECIP_SQRT. > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index 1d5d898..999c9fc 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings = > 0, /* max_case_values. */ > 0, /* cache_line_size. */ > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > - (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS > - | AARCH64_EXTRA_TUNE_RECIP_SQRT) /* tune_flags. */ > + (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags. */ > }; > > static const struct tune_params cortexa72_tunings =
On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote: > On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote: > > > > Hi, > > > > I've seen a couple of large performance issues caused by expanding > > the high-precision reciprocal square root for Cortex-A57, so I'd like > > to turn it off by default. > > > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from > > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for > > some private microbenchmark kernels which stress the divide/sqrt/multiply > > units. It therefore seems to me to be the correct choice to make across > > a number of workloads. > > > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues. > > > > OK? > > *Ping* *pingx2* Thanks, James > > --- > > 2015-12-11 James Greenhalgh <james.greenhalgh@arm.com> > > > > * config/aarch64/aarch64.c (cortexa57_tunings): Remove > > AARCH64_EXTRA_TUNE_RECIP_SQRT. > > > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > > index 1d5d898..999c9fc 100644 > > --- a/gcc/config/aarch64/aarch64.c > > +++ b/gcc/config/aarch64/aarch64.c > > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings = > > 0, /* max_case_values. */ > > 0, /* cache_line_size. */ > > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > > - (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS > > - | AARCH64_EXTRA_TUNE_RECIP_SQRT) /* tune_flags. */ > > + (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags. */ > > }; > > > > static const struct tune_params cortexa72_tunings = >
On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote: > On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote: > > On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote: > > > > > > Hi, > > > > > > I've seen a couple of large performance issues caused by expanding > > > the high-precision reciprocal square root for Cortex-A57, so I'd like > > > to turn it off by default. > > > > > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from > > > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for > > > some private microbenchmark kernels which stress the divide/sqrt/multiply > > > units. It therefore seems to me to be the correct choice to make across > > > a number of workloads. > > > > > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues. > > > > > > OK? > > > > *Ping* > > *pingx2* *ping^3* Thanks, James > > > --- > > > 2015-12-11 James Greenhalgh <james.greenhalgh@arm.com> > > > > > > * config/aarch64/aarch64.c (cortexa57_tunings): Remove > > > AARCH64_EXTRA_TUNE_RECIP_SQRT. > > > > > > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > > > index 1d5d898..999c9fc 100644 > > > --- a/gcc/config/aarch64/aarch64.c > > > +++ b/gcc/config/aarch64/aarch64.c > > > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings = > > > 0, /* max_case_values. */ > > > 0, /* cache_line_size. */ > > > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > > > - (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS > > > - | AARCH64_EXTRA_TUNE_RECIP_SQRT) /* tune_flags. */ > > > + (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags. */ > > > }; > > > > > > static const struct tune_params cortexa72_tunings = > > >
On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote: > On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote: > > On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote: > > > On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote: > > > > > > > > Hi, > > > > > > > > I've seen a couple of large performance issues caused by expanding > > > > the high-precision reciprocal square root for Cortex-A57, so I'd like > > > > to turn it off by default. > > > > > > > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from > > > > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for > > > > some private microbenchmark kernels which stress the divide/sqrt/multiply > > > > units. It therefore seems to me to be the correct choice to make across > > > > a number of workloads. > > > > > > > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues. > > > > > > > > OK? > > > > > > *Ping* > > > > *pingx2* > > *ping^3* *ping^4* Thanks, James > > > > --- > > > > 2015-12-11 James Greenhalgh <james.greenhalgh@arm.com> > > > > > > > > * config/aarch64/aarch64.c (cortexa57_tunings): Remove > > > > AARCH64_EXTRA_TUNE_RECIP_SQRT. > > > > > > > > > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > > > > index 1d5d898..999c9fc 100644 > > > > --- a/gcc/config/aarch64/aarch64.c > > > > +++ b/gcc/config/aarch64/aarch64.c > > > > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings = > > > > 0, /* max_case_values. */ > > > > 0, /* cache_line_size. */ > > > > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > > > > - (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS > > > > - | AARCH64_EXTRA_TUNE_RECIP_SQRT) /* tune_flags. */ > > > > + (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags. */ > > > > }; > > > > > > > > static const struct tune_params cortexa72_tunings = > > > > > >
On 02/15/16 04:50, James Greenhalgh wrote: > On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote: >> On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote: >>> On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote: >>>> On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote: >>>>> Hi, >>>>> >>>>> I've seen a couple of large performance issues caused by expanding >>>>> the high-precision reciprocal square root for Cortex-A57, so I'd like >>>>> to turn it off by default. >>>>> >>>>> This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from >>>>> Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for >>>>> some private microbenchmark kernels which stress the divide/sqrt/multiply >>>>> units. It therefore seems to me to be the correct choice to make across >>>>> a number of workloads. >>>>> >>>>> Bootstrapped and tested on aarch64-none-linux-gnu with no issues. >>>>> >>>>> OK? >>>> *Ping* >>> *pingx2* >> *ping^3* > *ping^4* > > Thanks, > James > >>>>> --- >>>>> 2015-12-11 James Greenhalgh <james.greenhalgh@arm.com> >>>>> >>>>> * config/aarch64/aarch64.c (cortexa57_tunings): Remove >>>>> AARCH64_EXTRA_TUNE_RECIP_SQRT. >>>>> >>>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c >>>>> index 1d5d898..999c9fc 100644 >>>>> --- a/gcc/config/aarch64/aarch64.c >>>>> +++ b/gcc/config/aarch64/aarch64.c >>>>> @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings = >>>>> 0, /* max_case_values. */ >>>>> 0, /* cache_line_size. */ >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>> - (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS >>>>> - | AARCH64_EXTRA_TUNE_RECIP_SQRT) /* tune_flags. */ >>>>> + (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags. */ >>>>> }; >>>>> >>>>> static const struct tune_params cortexa72_tunings = > James, There seem to be SPEC CPU2000fp validation issues on A57 when this flag is present too. Though I evaluated the algorithm with a huge random set of values, always delivering accuracy around 1ulp, which should be enough for CPU2000fp (wit x86-64), I expected the benchmarks to pass. My suspicion is that the Newton series on AArch64 is probably good only for SP. Then, DP might require an extra round, probably exacerbating the performance penalty. I'd like to try to split this tuning option into one for SP and another for DP. Thoughts? Thank you,
On 11 January 2016 at 12:04, James Greenhalgh <james.greenhalgh@arm.com> wrote: > 2015-12-11 James Greenhalgh <james.greenhalgh@arm.com> > > * config/aarch64/aarch64.c (cortexa57_tunings): Remove > AARCH64_EXTRA_TUNE_RECIP_SQRT. > OK /Marcus
On Mon, Feb 15, 2016 at 11:24:53AM -0600, Evandro Menezes wrote: > On 02/15/16 04:50, James Greenhalgh wrote: > >On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote: > >>On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote: > >>>On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote: > >>>>On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote: > >>>>>Hi, > >>>>> > >>>>>I've seen a couple of large performance issues caused by expanding > >>>>>the high-precision reciprocal square root for Cortex-A57, so I'd like > >>>>>to turn it off by default. > >>>>> > >>>>>This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from > >>>>>Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for > >>>>>some private microbenchmark kernels which stress the divide/sqrt/multiply > >>>>>units. It therefore seems to me to be the correct choice to make across > >>>>>a number of workloads. > >>>>> > >>>>>Bootstrapped and tested on aarch64-none-linux-gnu with no issues. > >>>>> > >>>>>OK? > >>>>*Ping* > >>>*pingx2* > >>*ping^3* > >*ping^4* > > > >Thanks, > >James > > > >>>>>--- > >>>>>2015-12-11 James Greenhalgh <james.greenhalgh@arm.com> > >>>>> > >>>>> * config/aarch64/aarch64.c (cortexa57_tunings): Remove > >>>>> AARCH64_EXTRA_TUNE_RECIP_SQRT. > >>>>> > >>>>>diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > >>>>>index 1d5d898..999c9fc 100644 > >>>>>--- a/gcc/config/aarch64/aarch64.c > >>>>>+++ b/gcc/config/aarch64/aarch64.c > >>>>>@@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings = > >>>>> 0, /* max_case_values. */ > >>>>> 0, /* cache_line_size. */ > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>- (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS > >>>>>- | AARCH64_EXTRA_TUNE_RECIP_SQRT) /* tune_flags. */ > >>>>>+ (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags. */ > >>>>> }; > >>>>> static const struct tune_params cortexa72_tunings = > > > > James, > > There seem to be SPEC CPU2000fp validation issues on A57 when this > flag is present too. Though I evaluated the algorithm with a huge > random set of values, always delivering accuracy around 1ulp, which > should be enough for CPU2000fp (wit x86-64), I expected the > benchmarks to pass. > > My suspicion is that the Newton series on AArch64 is probably good > only for SP. Then, DP might require an extra round, probably > exacerbating the performance penalty. > > I'd like to try to split this tuning option into one for SP and > another for DP. Thoughts? I haven't seen validation issues with the default expansion, but with -mlow-precision-recip-sqrt I do see failures. I think this is to be expected. I don't support splitting the low-precision flag to "-mlow-precision-float-recip-sqrt" and "-mlow-precision-double-recip-sqrt", I think that is pushing a particular set of Spec tuning flags over any meaningful use case. I could imagine a case for splitting the internal tuning flag to give AARCH64_EXTRA_TUNE_SF_RECIP_SQRT and AARCH64_EXTRA_TUNE_DF_RECIP_SQRT, but I'm not sure I understand the benefits of this? Certainly, I think your goals for performance (turn on for 64-bit divide/sqrt) would contradict your goals for accuracy (turn off for 64-bit divide/sqrt). I'm happy with these flags as they are, but I might be missing a subtelty in your argument? Thanks, James
On 02/16/16 04:28, James Greenhalgh wrote: > On Mon, Feb 15, 2016 at 11:24:53AM -0600, Evandro Menezes wrote: >> James, >> >> There seem to be SPEC CPU2000fp validation issues on A57 when this >> flag is present too. Though I evaluated the algorithm with a huge >> random set of values, always delivering accuracy around 1ulp, which >> should be enough for CPU2000fp (wit x86-64), I expected the >> benchmarks to pass. >> >> My suspicion is that the Newton series on AArch64 is probably good >> only for SP. Then, DP might require an extra round, probably >> exacerbating the performance penalty. >> >> I'd like to try to split this tuning option into one for SP and >> another for DP. Thoughts? > I haven't seen validation issues with the default expansion, but with > -mlow-precision-recip-sqrt I do see failures. I think this is to be > expected. I don't support splitting the low-precision flag to > "-mlow-precision-float-recip-sqrt" and "-mlow-precision-double-recip-sqrt", > I think that is pushing a particular set of Spec tuning flags over any > meaningful use case. > > I could imagine a case for splitting the internal tuning flag to give > AARCH64_EXTRA_TUNE_SF_RECIP_SQRT and AARCH64_EXTRA_TUNE_DF_RECIP_SQRT, but > I'm not sure I understand the benefits of this? Certainly, I think your goals > for performance (turn on for 64-bit divide/sqrt) would contradict your goals > for accuracy (turn off for 64-bit divide/sqrt). > > I'm happy with these flags as they are, but I might be missing a subtelty in > your argument? James, I'm still in sorting out the data, but, indeed, I see no validation issues with the approximate reciprocal square root in CPU2000. However, working on a patch to use the Newton series for square root based on the approximate reciprocal square root (d^1/2 = d * d^-1/2), I stumbled at validation errors. I'll take the discussion to that thread. Stand by...
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 1d5d898..999c9fc 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings = 0, /* max_case_values. */ 0, /* cache_line_size. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS - | AARCH64_EXTRA_TUNE_RECIP_SQRT) /* tune_flags. */ + (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags. */ }; static const struct tune_params cortexa72_tunings =