diff mbox series

RISC-V: Enable overlap-by-pieces in case of fast unaliged access

Message ID 20210722125258.3640995-1-cmuellner@gcc.gnu.org
State New
Headers show
Series RISC-V: Enable overlap-by-pieces in case of fast unaliged access | expand

Commit Message

Christoph Müllner July 22, 2021, 12:52 p.m. UTC
This patch enables the overlap-by-pieces feature of the by-pieces
infrastructure for inlining builtins in case the target has set
riscv_slow_unaligned_access_p to false.

To demonstrate the effect for targets with fast unaligned access,
the following code sequences are generated for a 15-byte memset-zero.

Without overlap_op_by_pieces we get:
  8e:   00053023                sd      zero,0(a0)
  92:   00052423                sw      zero,8(a0)
  96:   00051623                sh      zero,12(a0)
  9a:   00050723                sb      zero,14(a0)

With overlap_op_by_pieces we get:
  7e:   00053023                sd      zero,0(a0)
  82:   000533a3                sd      zero,7(a0)

gcc/ChangeLog:

	* config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
	(TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
	riscv_overlap_op_by_pieces.

Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
---
 gcc/config/riscv/riscv.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Kito Cheng July 22, 2021, 1:29 p.m. UTC | #1
Could you add a testcase? Otherwise LGTM.

Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
void foo(char *dst){
   __builtin_memset(dst, 0, 15);
}

On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> This patch enables the overlap-by-pieces feature of the by-pieces
> infrastructure for inlining builtins in case the target has set
> riscv_slow_unaligned_access_p to false.
>
> To demonstrate the effect for targets with fast unaligned access,
> the following code sequences are generated for a 15-byte memset-zero.
>
> Without overlap_op_by_pieces we get:
>   8e:   00053023                sd      zero,0(a0)
>   92:   00052423                sw      zero,8(a0)
>   96:   00051623                sh      zero,12(a0)
>   9a:   00050723                sb      zero,14(a0)
>
> With overlap_op_by_pieces we get:
>   7e:   00053023                sd      zero,0(a0)
>   82:   000533a3                sd      zero,7(a0)
>
> gcc/ChangeLog:
>
>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
>         riscv_overlap_op_by_pieces.
>
> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
> ---
>  gcc/config/riscv/riscv.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> index 576960bb37c..98c76ba657a 100644
> --- a/gcc/config/riscv/riscv.c
> +++ b/gcc/config/riscv/riscv.c
> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
>    return riscv_slow_unaligned_access_p;
>  }
>
> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> +
> +static bool
> +riscv_overlap_op_by_pieces (void)
> +{
> +  return !riscv_slow_unaligned_access_p;
> +}
> +
>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
>
>  static bool
> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
>  #undef TARGET_SLOW_UNALIGNED_ACCESS
>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
>
> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> +
>  #undef TARGET_SECONDARY_MEMORY_NEEDED
>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
>
> --
> 2.31.1
>
Palmer Dabbelt July 22, 2021, 5:27 p.m. UTC | #2
On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> Could you add a testcase? Otherwise LGTM.
>
> Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> void foo(char *dst){
>    __builtin_memset(dst, 0, 15);
> }

I'd like to see:

* Test results.  This is only on for one target right now, so relying on 
  it to just work on others isn't a good idea.
* Something to demonstrate this doesn't break -mstrict-align.
* Some sort of performance analysis.  Most machines that support 
  unaligned access do so with some performance degredation, it's unclear 
  that this conversion will actually manifst a performance improvement.  
  I don't have a C906 and don't know what workloads people care about 
  running on one, but I'm certainly not convinced this is a win -- 
  what's listed here looks to be the best case, and that's only saving 
  two instructions to generate a pretty pathological sequence 
  (misaligned access that conflicts with a prior store).

Jojo: do you have any description of the C906 pipeline?  Specifically in 
this case it'd be good to know how it handles unaligned accesses.

>
> On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> This patch enables the overlap-by-pieces feature of the by-pieces
>> infrastructure for inlining builtins in case the target has set
>> riscv_slow_unaligned_access_p to false.
>>
>> To demonstrate the effect for targets with fast unaligned access,
>> the following code sequences are generated for a 15-byte memset-zero.
>>
>> Without overlap_op_by_pieces we get:
>>   8e:   00053023                sd      zero,0(a0)
>>   92:   00052423                sw      zero,8(a0)
>>   96:   00051623                sh      zero,12(a0)
>>   9a:   00050723                sb      zero,14(a0)
>>
>> With overlap_op_by_pieces we get:
>>   7e:   00053023                sd      zero,0(a0)
>>   82:   000533a3                sd      zero,7(a0)
>>
>> gcc/ChangeLog:
>>
>>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
>>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
>>         riscv_overlap_op_by_pieces.
>>
>> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
>> ---
>>  gcc/config/riscv/riscv.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
>> index 576960bb37c..98c76ba657a 100644
>> --- a/gcc/config/riscv/riscv.c
>> +++ b/gcc/config/riscv/riscv.c
>> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
>>    return riscv_slow_unaligned_access_p;
>>  }
>>
>> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
>> +
>> +static bool
>> +riscv_overlap_op_by_pieces (void)
>> +{
>> +  return !riscv_slow_unaligned_access_p;
>> +}
>> +
>>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
>>
>>  static bool
>> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
>>  #undef TARGET_SLOW_UNALIGNED_ACCESS
>>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
>>
>> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
>> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
>> +
>>  #undef TARGET_SECONDARY_MEMORY_NEEDED
>>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
>>
>> --
>> 2.31.1
>>
Christoph Müllner July 22, 2021, 6:23 p.m. UTC | #3
On Thu, Jul 22, 2021 at 7:27 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> > Could you add a testcase? Otherwise LGTM.
> >
> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> > void foo(char *dst){
> >    __builtin_memset(dst, 0, 15);
> > }
>
> I'd like to see:
>
> * Test results.  This is only on for one target right now, so relying on
>   it to just work on others isn't a good idea.

So you prefer the previous version of the patch, where each
tune struct can decide if that feature should be enabled or not?

> * Something to demonstrate this doesn't break -mstrict-align.

-mstrict-align sets riscv_slow_unaligned_access_p to true.
This will be returned by TARGET_SLOW_UNALIGNED_ACCESS.
So, this cannot happen. I will add an additional test for that.

> * Some sort of performance analysis.  Most machines that support
>   unaligned access do so with some performance degredation, it's unclear
>   that this conversion will actually manifst a performance improvement.
>   I don't have a C906 and don't know what workloads people care about
>   running on one, but I'm certainly not convinced this is a win --
>   what's listed here looks to be the best case, and that's only saving
>   two instructions to generate a pretty pathological sequence
>   (misaligned access that conflicts with a prior store).
>
> Jojo: do you have any description of the C906 pipeline?  Specifically in
> this case it'd be good to know how it handles unaligned accesses.

The tune struct already includes the field slow_unaligned_access.
For c906 this is set to false. So the answer is: c906 handles unaligned
access reasonably well (assuming the contents of its tune struct are correct).

Note, that the overlapping access only happens if unaligned accesses
are allowed.
If slow_unaligned_access is set, then you will get a sequence of
store-byte instructions.

>
> >
> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> This patch enables the overlap-by-pieces feature of the by-pieces
> >> infrastructure for inlining builtins in case the target has set
> >> riscv_slow_unaligned_access_p to false.
> >>
> >> To demonstrate the effect for targets with fast unaligned access,
> >> the following code sequences are generated for a 15-byte memset-zero.
> >>
> >> Without overlap_op_by_pieces we get:
> >>   8e:   00053023                sd      zero,0(a0)
> >>   92:   00052423                sw      zero,8(a0)
> >>   96:   00051623                sh      zero,12(a0)
> >>   9a:   00050723                sb      zero,14(a0)
> >>
> >> With overlap_op_by_pieces we get:
> >>   7e:   00053023                sd      zero,0(a0)
> >>   82:   000533a3                sd      zero,7(a0)
> >>
> >> gcc/ChangeLog:
> >>
> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
> >>         riscv_overlap_op_by_pieces.
> >>
> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
> >> ---
> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
> >>  1 file changed, 11 insertions(+)
> >>
> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> >> index 576960bb37c..98c76ba657a 100644
> >> --- a/gcc/config/riscv/riscv.c
> >> +++ b/gcc/config/riscv/riscv.c
> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
> >>    return riscv_slow_unaligned_access_p;
> >>  }
> >>
> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> >> +
> >> +static bool
> >> +riscv_overlap_op_by_pieces (void)
> >> +{
> >> +  return !riscv_slow_unaligned_access_p;
> >> +}
> >> +
> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
> >>
> >>  static bool
> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
> >>
> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> >> +
> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
> >>
> >> --
> >> 2.31.1
> >>
Christoph Müllner July 22, 2021, 10:54 p.m. UTC | #4
I have added tests for memset and memcpy.
Additionally, I have added a test to ensure that -mstrict-align is
still working.
I've run the complete GCC test suite with and without the resulting
patch with no regressions
(rv32imac/ilp32/medlow, rv32imafdc/ilp32d/medlow,
rv64imac/lp64/medlow, rv64imafdc/lp64d/medlow).

I have not changed the patch itself, because I think it is reasonable
to assume that overlapping access
(i.e. reducing store instructions) will always be cheaper on targets,
that don't have penalties for unaligned accesses.
If maintainers disagree, I can change accordingly.

The new patch can be found here:
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html


On Thu, Jul 22, 2021 at 8:23 PM Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
>
> On Thu, Jul 22, 2021 at 7:27 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> >
> > On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> > > Could you add a testcase? Otherwise LGTM.
> > >
> > > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> > > void foo(char *dst){
> > >    __builtin_memset(dst, 0, 15);
> > > }
> >
> > I'd like to see:
> >
> > * Test results.  This is only on for one target right now, so relying on
> >   it to just work on others isn't a good idea.
>
> So you prefer the previous version of the patch, where each
> tune struct can decide if that feature should be enabled or not?
>
> > * Something to demonstrate this doesn't break -mstrict-align.
>
> -mstrict-align sets riscv_slow_unaligned_access_p to true.
> This will be returned by TARGET_SLOW_UNALIGNED_ACCESS.
> So, this cannot happen. I will add an additional test for that.
>
> > * Some sort of performance analysis.  Most machines that support
> >   unaligned access do so with some performance degredation, it's unclear
> >   that this conversion will actually manifst a performance improvement.
> >   I don't have a C906 and don't know what workloads people care about
> >   running on one, but I'm certainly not convinced this is a win --
> >   what's listed here looks to be the best case, and that's only saving
> >   two instructions to generate a pretty pathological sequence
> >   (misaligned access that conflicts with a prior store).
> >
> > Jojo: do you have any description of the C906 pipeline?  Specifically in
> > this case it'd be good to know how it handles unaligned accesses.
>
> The tune struct already includes the field slow_unaligned_access.
> For c906 this is set to false. So the answer is: c906 handles unaligned
> access reasonably well (assuming the contents of its tune struct are correct).
>
> Note, that the overlapping access only happens if unaligned accesses
> are allowed.
> If slow_unaligned_access is set, then you will get a sequence of
> store-byte instructions.
>
> >
> > >
> > > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > >>
> > >> This patch enables the overlap-by-pieces feature of the by-pieces
> > >> infrastructure for inlining builtins in case the target has set
> > >> riscv_slow_unaligned_access_p to false.
> > >>
> > >> To demonstrate the effect for targets with fast unaligned access,
> > >> the following code sequences are generated for a 15-byte memset-zero.
> > >>
> > >> Without overlap_op_by_pieces we get:
> > >>   8e:   00053023                sd      zero,0(a0)
> > >>   92:   00052423                sw      zero,8(a0)
> > >>   96:   00051623                sh      zero,12(a0)
> > >>   9a:   00050723                sb      zero,14(a0)
> > >>
> > >> With overlap_op_by_pieces we get:
> > >>   7e:   00053023                sd      zero,0(a0)
> > >>   82:   000533a3                sd      zero,7(a0)
> > >>
> > >> gcc/ChangeLog:
> > >>
> > >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
> > >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
> > >>         riscv_overlap_op_by_pieces.
> > >>
> > >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
> > >> ---
> > >>  gcc/config/riscv/riscv.c | 11 +++++++++++
> > >>  1 file changed, 11 insertions(+)
> > >>
> > >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> > >> index 576960bb37c..98c76ba657a 100644
> > >> --- a/gcc/config/riscv/riscv.c
> > >> +++ b/gcc/config/riscv/riscv.c
> > >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
> > >>    return riscv_slow_unaligned_access_p;
> > >>  }
> > >>
> > >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> > >> +
> > >> +static bool
> > >> +riscv_overlap_op_by_pieces (void)
> > >> +{
> > >> +  return !riscv_slow_unaligned_access_p;
> > >> +}
> > >> +
> > >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
> > >>
> > >>  static bool
> > >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
> > >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
> > >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
> > >>
> > >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> > >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> > >> +
> > >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
> > >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
> > >>
> > >> --
> > >> 2.31.1
> > >>
Kito Cheng July 26, 2021, 9:48 a.m. UTC | #5
LGTM, but I would like to wait Palmer's ack.

I am fine with current scheme, just check riscv_slow_unaligned_access_p,
in case we have performance issue, we can consider change mtune's
slow_unaligned_access field to a number rather than a boolean value,
so that we can have better cost model for that.


On Fri, Jul 23, 2021 at 6:55 AM Christoph Müllner via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> I have added tests for memset and memcpy.
> Additionally, I have added a test to ensure that -mstrict-align is
> still working.
> I've run the complete GCC test suite with and without the resulting
> patch with no regressions
> (rv32imac/ilp32/medlow, rv32imafdc/ilp32d/medlow,
> rv64imac/lp64/medlow, rv64imafdc/lp64d/medlow).
>
> I have not changed the patch itself, because I think it is reasonable
> to assume that overlapping access
> (i.e. reducing store instructions) will always be cheaper on targets,
> that don't have penalties for unaligned accesses.
> If maintainers disagree, I can change accordingly.
>
> The new patch can be found here:
> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html
>
>
> On Thu, Jul 22, 2021 at 8:23 PM Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
> >
> > On Thu, Jul 22, 2021 at 7:27 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> > >
> > > On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> > > > Could you add a testcase? Otherwise LGTM.
> > > >
> > > > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> > > > void foo(char *dst){
> > > >    __builtin_memset(dst, 0, 15);
> > > > }
> > >
> > > I'd like to see:
> > >
> > > * Test results.  This is only on for one target right now, so relying on
> > >   it to just work on others isn't a good idea.
> >
> > So you prefer the previous version of the patch, where each
> > tune struct can decide if that feature should be enabled or not?
> >
> > > * Something to demonstrate this doesn't break -mstrict-align.
> >
> > -mstrict-align sets riscv_slow_unaligned_access_p to true.
> > This will be returned by TARGET_SLOW_UNALIGNED_ACCESS.
> > So, this cannot happen. I will add an additional test for that.
> >
> > > * Some sort of performance analysis.  Most machines that support
> > >   unaligned access do so with some performance degredation, it's unclear
> > >   that this conversion will actually manifst a performance improvement.
> > >   I don't have a C906 and don't know what workloads people care about
> > >   running on one, but I'm certainly not convinced this is a win --
> > >   what's listed here looks to be the best case, and that's only saving
> > >   two instructions to generate a pretty pathological sequence
> > >   (misaligned access that conflicts with a prior store).
> > >
> > > Jojo: do you have any description of the C906 pipeline?  Specifically in
> > > this case it'd be good to know how it handles unaligned accesses.
> >
> > The tune struct already includes the field slow_unaligned_access.
> > For c906 this is set to false. So the answer is: c906 handles unaligned
> > access reasonably well (assuming the contents of its tune struct are correct).
> >
> > Note, that the overlapping access only happens if unaligned accesses
> > are allowed.
> > If slow_unaligned_access is set, then you will get a sequence of
> > store-byte instructions.
> >
> > >
> > > >
> > > > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > >>
> > > >> This patch enables the overlap-by-pieces feature of the by-pieces
> > > >> infrastructure for inlining builtins in case the target has set
> > > >> riscv_slow_unaligned_access_p to false.
> > > >>
> > > >> To demonstrate the effect for targets with fast unaligned access,
> > > >> the following code sequences are generated for a 15-byte memset-zero.
> > > >>
> > > >> Without overlap_op_by_pieces we get:
> > > >>   8e:   00053023                sd      zero,0(a0)
> > > >>   92:   00052423                sw      zero,8(a0)
> > > >>   96:   00051623                sh      zero,12(a0)
> > > >>   9a:   00050723                sb      zero,14(a0)
> > > >>
> > > >> With overlap_op_by_pieces we get:
> > > >>   7e:   00053023                sd      zero,0(a0)
> > > >>   82:   000533a3                sd      zero,7(a0)
> > > >>
> > > >> gcc/ChangeLog:
> > > >>
> > > >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
> > > >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
> > > >>         riscv_overlap_op_by_pieces.
> > > >>
> > > >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
> > > >> ---
> > > >>  gcc/config/riscv/riscv.c | 11 +++++++++++
> > > >>  1 file changed, 11 insertions(+)
> > > >>
> > > >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> > > >> index 576960bb37c..98c76ba657a 100644
> > > >> --- a/gcc/config/riscv/riscv.c
> > > >> +++ b/gcc/config/riscv/riscv.c
> > > >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
> > > >>    return riscv_slow_unaligned_access_p;
> > > >>  }
> > > >>
> > > >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> > > >> +
> > > >> +static bool
> > > >> +riscv_overlap_op_by_pieces (void)
> > > >> +{
> > > >> +  return !riscv_slow_unaligned_access_p;
> > > >> +}
> > > >> +
> > > >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
> > > >>
> > > >>  static bool
> > > >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
> > > >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
> > > >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
> > > >>
> > > >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> > > >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> > > >> +
> > > >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
> > > >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
> > > >>
> > > >> --
> > > >> 2.31.1
> > > >>
Andrew Waterman July 26, 2021, 10:05 a.m. UTC | #6
On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> > Could you add a testcase? Otherwise LGTM.
> >
> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> > void foo(char *dst){
> >    __builtin_memset(dst, 0, 15);
> > }
>
> I'd like to see:
>
> * Test results.  This is only on for one target right now, so relying on
>   it to just work on others isn't a good idea.
> * Something to demonstrate this doesn't break -mstrict-align.
> * Some sort of performance analysis.  Most machines that support
>   unaligned access do so with some performance degredation,

Also, some machines that gracefully support misaligned accesses under
most circumstances nevertheless experience a perf degradation when the
load depends on two stores that overlap partially but not fully.  This
transformation will obviously trigger such behavior from time to time.

Note, I'm not objecting to this patch; I'm only responding to Palmer's
comment.  It makes sense to enable this kind of optimization for
-mtune=native etc., just not for standard software distributions.


>   it's unclear
>   that this conversion will actually manifst a performance improvement.
>   I don't have a C906 and don't know what workloads people care about
>   running on one, but I'm certainly not convinced this is a win --
>   what's listed here looks to be the best case, and that's only saving
>   two instructions to generate a pretty pathological sequence
>   (misaligned access that conflicts with a prior store).
>
> Jojo: do you have any description of the C906 pipeline?  Specifically in
> this case it'd be good to know how it handles unaligned accesses.
>
> >
> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> This patch enables the overlap-by-pieces feature of the by-pieces
> >> infrastructure for inlining builtins in case the target has set
> >> riscv_slow_unaligned_access_p to false.
> >>
> >> To demonstrate the effect for targets with fast unaligned access,
> >> the following code sequences are generated for a 15-byte memset-zero.
> >>
> >> Without overlap_op_by_pieces we get:
> >>   8e:   00053023                sd      zero,0(a0)
> >>   92:   00052423                sw      zero,8(a0)
> >>   96:   00051623                sh      zero,12(a0)
> >>   9a:   00050723                sb      zero,14(a0)
> >>
> >> With overlap_op_by_pieces we get:
> >>   7e:   00053023                sd      zero,0(a0)
> >>   82:   000533a3                sd      zero,7(a0)
> >>
> >> gcc/ChangeLog:
> >>
> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
> >>         riscv_overlap_op_by_pieces.
> >>
> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
> >> ---
> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
> >>  1 file changed, 11 insertions(+)
> >>
> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> >> index 576960bb37c..98c76ba657a 100644
> >> --- a/gcc/config/riscv/riscv.c
> >> +++ b/gcc/config/riscv/riscv.c
> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
> >>    return riscv_slow_unaligned_access_p;
> >>  }
> >>
> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> >> +
> >> +static bool
> >> +riscv_overlap_op_by_pieces (void)
> >> +{
> >> +  return !riscv_slow_unaligned_access_p;
> >> +}
> >> +
> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
> >>
> >>  static bool
> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
> >>
> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> >> +
> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
> >>
> >> --
> >> 2.31.1
> >>
Palmer Dabbelt July 27, 2021, 1:47 a.m. UTC | #7
On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:
> On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>>
>> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
>> > Could you add a testcase? Otherwise LGTM.
>> >
>> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
>> > void foo(char *dst){
>> >    __builtin_memset(dst, 0, 15);
>> > }
>>
>> I'd like to see:
>>
>> * Test results.  This is only on for one target right now, so relying on
>>   it to just work on others isn't a good idea.
>> * Something to demonstrate this doesn't break -mstrict-align.
>> * Some sort of performance analysis.  Most machines that support
>>   unaligned access do so with some performance degredation,
>
> Also, some machines that gracefully support misaligned accesses under
> most circumstances nevertheless experience a perf degradation when the
> load depends on two stores that overlap partially but not fully.  This
> transformation will obviously trigger such behavior from time to time.

Ya, I thought I wrote a response to this but I guess it's just in a 
buffer somewhere.  The code sequences this is generating are really the 
worst case for unaligned stores: one of them is always guaranteed to be 
misaligned, and it partially overlaps with a store one cycle away.

We're really only saving a handful of instructions at best here, so 
there's not much room for error when it comes to these sorts of things.  
Even if this difficult case is handled fast we're only talking about 
saving 2 cycles, which is pretty borderline as the single-issue in-order 
machines I've worked with that do support misaligned accesses tend to 
take at least a few cycles of performance hit on misaligned accesses.  
Even if misaligned accesses are single cycle, some back of the envelope 
calculations says that a pipeline flush when crossing a cache line would 
definitely push this negative and one per page crossing would be in the 
margins (depending on how we assume the original accesses are aligned).

This is way too subtle of a thing to analyze without at least some 
knowledge of how this pipeline works, whether that's from a benchmark or 
just a pipeline description.

> Note, I'm not objecting to this patch; I'm only responding to Palmer's
> comment.  It makes sense to enable this kind of optimization for
> -mtune=native etc., just not for standard software distributions.

IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native 
is sort of a red herring, it's just syntax).  For -mtune=c906 I'm happy 
enabling this as long as it's actually correct and improves performance, 
but that'll need at least some hardware-oriented analysis -- whether 
it's a benchmark or just a confirmation that these sequences are 
actually expected to run fast.

-Os is a different case, though.  Last time this came up we decided that 
-Os shouldn't purposefully misalign accesses, even when that saves code 
size, as that's too likely to hit pathological performance cases.  I 
don't know if the weights have changed here: the C906 is currently by 
far the cheapest chip (which likely means it's going to be the most 
popular), but it's unclear as to whether it's even RISC-V compliant and 
I don't know of many people who've actually gotten one.  IMO it's too 
early to flip -Os over to this behavior, we at least need to know that 
this chip is going to be sufficiently RISC-V-ey that standard software 
will even run on it much less be popular.

>
>
>>   it's unclear
>>   that this conversion will actually manifst a performance improvement.
>>   I don't have a C906 and don't know what workloads people care about
>>   running on one, but I'm certainly not convinced this is a win --
>>   what's listed here looks to be the best case, and that's only saving
>>   two instructions to generate a pretty pathological sequence
>>   (misaligned access that conflicts with a prior store).

Ah, I guess there it was ;)

>>
>> Jojo: do you have any description of the C906 pipeline?  Specifically in
>> this case it'd be good to know how it handles unaligned accesses.
>>
>> >
>> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
>> > <gcc-patches@gcc.gnu.org> wrote:
>> >>
>> >> This patch enables the overlap-by-pieces feature of the by-pieces
>> >> infrastructure for inlining builtins in case the target has set
>> >> riscv_slow_unaligned_access_p to false.
>> >>
>> >> To demonstrate the effect for targets with fast unaligned access,
>> >> the following code sequences are generated for a 15-byte memset-zero.
>> >>
>> >> Without overlap_op_by_pieces we get:
>> >>   8e:   00053023                sd      zero,0(a0)
>> >>   92:   00052423                sw      zero,8(a0)
>> >>   96:   00051623                sh      zero,12(a0)
>> >>   9a:   00050723                sb      zero,14(a0)
>> >>
>> >> With overlap_op_by_pieces we get:
>> >>   7e:   00053023                sd      zero,0(a0)
>> >>   82:   000533a3                sd      zero,7(a0)
>> >>
>> >> gcc/ChangeLog:
>> >>
>> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
>> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
>> >>         riscv_overlap_op_by_pieces.
>> >>
>> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
>> >> ---
>> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
>> >>  1 file changed, 11 insertions(+)
>> >>
>> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
>> >> index 576960bb37c..98c76ba657a 100644
>> >> --- a/gcc/config/riscv/riscv.c
>> >> +++ b/gcc/config/riscv/riscv.c
>> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
>> >>    return riscv_slow_unaligned_access_p;
>> >>  }
>> >>
>> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
>> >> +
>> >> +static bool
>> >> +riscv_overlap_op_by_pieces (void)
>> >> +{
>> >> +  return !riscv_slow_unaligned_access_p;
>> >> +}
>> >> +
>> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
>> >>
>> >>  static bool
>> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
>> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
>> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
>> >>
>> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
>> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
>> >> +
>> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
>> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
>> >>
>> >> --
>> >> 2.31.1
>> >>
Christoph Müllner July 27, 2021, 9:32 a.m. UTC | #8
Ok, so if I understand correctly Palmer and Andrew prefer
overlap_op_by_pieces to be controlled
by its own field in the riscv_tune_param struct and not by the field
slow_unaligned_access in this struct
(i.e. slow_unaligned_access==false is not enough to imply
overlap_op_by_pieces==true).

I don't have access to pipeline details that give proof that there are cases
where this patch causes a performance penalty.

So, I leave this here as a summary for someone who has enough information and
interest to move this forward:
* the original patch should be sufficient, but does not have tests:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html
* the tests can be taken from this patch:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html
  Note, that there is a duplicated "sw" in builtins-overlap-6.c, which
should be a "sd".

Thanks for the feedback!

On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:
> > On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> >>
> >> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> >> > Could you add a testcase? Otherwise LGTM.
> >> >
> >> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> >> > void foo(char *dst){
> >> >    __builtin_memset(dst, 0, 15);
> >> > }
> >>
> >> I'd like to see:
> >>
> >> * Test results.  This is only on for one target right now, so relying on
> >>   it to just work on others isn't a good idea.
> >> * Something to demonstrate this doesn't break -mstrict-align.
> >> * Some sort of performance analysis.  Most machines that support
> >>   unaligned access do so with some performance degredation,
> >
> > Also, some machines that gracefully support misaligned accesses under
> > most circumstances nevertheless experience a perf degradation when the
> > load depends on two stores that overlap partially but not fully.  This
> > transformation will obviously trigger such behavior from time to time.
>
> Ya, I thought I wrote a response to this but I guess it's just in a
> buffer somewhere.  The code sequences this is generating are really the
> worst case for unaligned stores: one of them is always guaranteed to be
> misaligned, and it partially overlaps with a store one cycle away.
>
> We're really only saving a handful of instructions at best here, so
> there's not much room for error when it comes to these sorts of things.
> Even if this difficult case is handled fast we're only talking about
> saving 2 cycles, which is pretty borderline as the single-issue in-order
> machines I've worked with that do support misaligned accesses tend to
> take at least a few cycles of performance hit on misaligned accesses.
> Even if misaligned accesses are single cycle, some back of the envelope
> calculations says that a pipeline flush when crossing a cache line would
> definitely push this negative and one per page crossing would be in the
> margins (depending on how we assume the original accesses are aligned).
>
> This is way too subtle of a thing to analyze without at least some
> knowledge of how this pipeline works, whether that's from a benchmark or
> just a pipeline description.
>
> > Note, I'm not objecting to this patch; I'm only responding to Palmer's
> > comment.  It makes sense to enable this kind of optimization for
> > -mtune=native etc., just not for standard software distributions.
>
> IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native
> is sort of a red herring, it's just syntax).  For -mtune=c906 I'm happy
> enabling this as long as it's actually correct and improves performance,
> but that'll need at least some hardware-oriented analysis -- whether
> it's a benchmark or just a confirmation that these sequences are
> actually expected to run fast.
>
> -Os is a different case, though.  Last time this came up we decided that
> -Os shouldn't purposefully misalign accesses, even when that saves code
> size, as that's too likely to hit pathological performance cases.  I
> don't know if the weights have changed here: the C906 is currently by
> far the cheapest chip (which likely means it's going to be the most
> popular), but it's unclear as to whether it's even RISC-V compliant and
> I don't know of many people who've actually gotten one.  IMO it's too
> early to flip -Os over to this behavior, we at least need to know that
> this chip is going to be sufficiently RISC-V-ey that standard software
> will even run on it much less be popular.
>
> >
> >
> >>   it's unclear
> >>   that this conversion will actually manifst a performance improvement.
> >>   I don't have a C906 and don't know what workloads people care about
> >>   running on one, but I'm certainly not convinced this is a win --
> >>   what's listed here looks to be the best case, and that's only saving
> >>   two instructions to generate a pretty pathological sequence
> >>   (misaligned access that conflicts with a prior store).
>
> Ah, I guess there it was ;)
>
> >>
> >> Jojo: do you have any description of the C906 pipeline?  Specifically in
> >> this case it'd be good to know how it handles unaligned accesses.
> >>
> >> >
> >> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> >> > <gcc-patches@gcc.gnu.org> wrote:
> >> >>
> >> >> This patch enables the overlap-by-pieces feature of the by-pieces
> >> >> infrastructure for inlining builtins in case the target has set
> >> >> riscv_slow_unaligned_access_p to false.
> >> >>
> >> >> To demonstrate the effect for targets with fast unaligned access,
> >> >> the following code sequences are generated for a 15-byte memset-zero.
> >> >>
> >> >> Without overlap_op_by_pieces we get:
> >> >>   8e:   00053023                sd      zero,0(a0)
> >> >>   92:   00052423                sw      zero,8(a0)
> >> >>   96:   00051623                sh      zero,12(a0)
> >> >>   9a:   00050723                sb      zero,14(a0)
> >> >>
> >> >> With overlap_op_by_pieces we get:
> >> >>   7e:   00053023                sd      zero,0(a0)
> >> >>   82:   000533a3                sd      zero,7(a0)
> >> >>
> >> >> gcc/ChangeLog:
> >> >>
> >> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
> >> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
> >> >>         riscv_overlap_op_by_pieces.
> >> >>
> >> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
> >> >> ---
> >> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
> >> >>  1 file changed, 11 insertions(+)
> >> >>
> >> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> >> >> index 576960bb37c..98c76ba657a 100644
> >> >> --- a/gcc/config/riscv/riscv.c
> >> >> +++ b/gcc/config/riscv/riscv.c
> >> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
> >> >>    return riscv_slow_unaligned_access_p;
> >> >>  }
> >> >>
> >> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> >> >> +
> >> >> +static bool
> >> >> +riscv_overlap_op_by_pieces (void)
> >> >> +{
> >> >> +  return !riscv_slow_unaligned_access_p;
> >> >> +}
> >> >> +
> >> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
> >> >>
> >> >>  static bool
> >> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
> >> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
> >> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
> >> >>
> >> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> >> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> >> >> +
> >> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
> >> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
> >> >>
> >> >> --
> >> >> 2.31.1
> >> >>
Palmer Dabbelt July 29, 2021, 6:53 p.m. UTC | #9
On Tue, 27 Jul 2021 02:32:12 PDT (-0700), cmuellner@gcc.gnu.org wrote:
> Ok, so if I understand correctly Palmer and Andrew prefer
> overlap_op_by_pieces to be controlled
> by its own field in the riscv_tune_param struct and not by the field
> slow_unaligned_access in this struct
> (i.e. slow_unaligned_access==false is not enough to imply
> overlap_op_by_pieces==true).

I guess, but I'm not really worried about this at that level of detail 
right now.  It's not like the tune structures form any sort of external 
interface we have to keep stable, we can do whatever we want with those 
fields so I'd just aim for encoding the desired behavior as simply as 
possible rather than trying to build something extensible.

There are really two questions we need to answer: is this code actually 
faster for the C906, and is this what the average users wants under -Os.  

That first one is pretty easy: just running those simple code sequences 
under a sweep of page offsets should be sufficient to determine if this 
is always faster (in which case it's an easy yes), if it's always slower 
(an easy no), or if there's some slow cases like page/cache line 
crossing (in which case we'd need to think a bit).

The second one is a bit tricker.  In the past we'd said these sort of 
"actively misalign accesses to generate smaller code" sort of thing 
isn't suitable for -Os (as most machines still have very slow unaligned 
accesses) but is suitable for -Oz (don't remember if that ever ended up 
in GCC, though).  That still seems like a reasonable decision, but if it 
turns out that implementations with fast unaligned accesses become the 
norm then it'd probably be worth revisiting it.  Not sure exactly how to 
determine that tipping point, but I think we're a long way away from it 
right now.

IMO it's really just premature to try and design an encoding of the 
tuning paramaters until we have an idea of what they are, as we'll just 
end up devolving down the path of trying to encode all possible hardware 
and that's generally a huge waste of time.  Since there's no ABI here we 
can refactor this however we want as new tunings show up.

> I don't have access to pipeline details that give proof that there are cases
> where this patch causes a performance penalty.
>
> So, I leave this here as a summary for someone who has enough information and
> interest to move this forward:
> * the original patch should be sufficient, but does not have tests:
>   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html
> * the tests can be taken from this patch:
>   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html
>   Note, that there is a duplicated "sw" in builtins-overlap-6.c, which
> should be a "sd".
>
> Thanks for the feedback!

Cool.  Looks like the C906 is starting to show up in the real world, so 
we should be able to find someone who has access to one and cares enough 
to at least run some simple benchamrks of these code sequences.  IMO 
that's a pretty low interest bar, so I don't see any harm in waiting -- 
when the hardware is common then I'm sure someone will care enough to 
give this a shot, and until then it's not really impacting anyone either 
way.

The -Os thing is a bigger discussion, and while I'm happy to have it I 
don't really think we're even close to these being common enough yet.  I 
saw your memmove patch and think the same rationale might apply there, 
but I haven't looked closely and won't have time to for a bit as I've 
got to get around to the other projects.

> On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>>
>> On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:
>> > On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>> >>
>> >> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
>> >> > Could you add a testcase? Otherwise LGTM.
>> >> >
>> >> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
>> >> > void foo(char *dst){
>> >> >    __builtin_memset(dst, 0, 15);
>> >> > }
>> >>
>> >> I'd like to see:
>> >>
>> >> * Test results.  This is only on for one target right now, so relying on
>> >>   it to just work on others isn't a good idea.
>> >> * Something to demonstrate this doesn't break -mstrict-align.
>> >> * Some sort of performance analysis.  Most machines that support
>> >>   unaligned access do so with some performance degredation,
>> >
>> > Also, some machines that gracefully support misaligned accesses under
>> > most circumstances nevertheless experience a perf degradation when the
>> > load depends on two stores that overlap partially but not fully.  This
>> > transformation will obviously trigger such behavior from time to time.
>>
>> Ya, I thought I wrote a response to this but I guess it's just in a
>> buffer somewhere.  The code sequences this is generating are really the
>> worst case for unaligned stores: one of them is always guaranteed to be
>> misaligned, and it partially overlaps with a store one cycle away.
>>
>> We're really only saving a handful of instructions at best here, so
>> there's not much room for error when it comes to these sorts of things.
>> Even if this difficult case is handled fast we're only talking about
>> saving 2 cycles, which is pretty borderline as the single-issue in-order
>> machines I've worked with that do support misaligned accesses tend to
>> take at least a few cycles of performance hit on misaligned accesses.
>> Even if misaligned accesses are single cycle, some back of the envelope
>> calculations says that a pipeline flush when crossing a cache line would
>> definitely push this negative and one per page crossing would be in the
>> margins (depending on how we assume the original accesses are aligned).
>>
>> This is way too subtle of a thing to analyze without at least some
>> knowledge of how this pipeline works, whether that's from a benchmark or
>> just a pipeline description.
>>
>> > Note, I'm not objecting to this patch; I'm only responding to Palmer's
>> > comment.  It makes sense to enable this kind of optimization for
>> > -mtune=native etc., just not for standard software distributions.
>>
>> IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native
>> is sort of a red herring, it's just syntax).  For -mtune=c906 I'm happy
>> enabling this as long as it's actually correct and improves performance,
>> but that'll need at least some hardware-oriented analysis -- whether
>> it's a benchmark or just a confirmation that these sequences are
>> actually expected to run fast.
>>
>> -Os is a different case, though.  Last time this came up we decided that
>> -Os shouldn't purposefully misalign accesses, even when that saves code
>> size, as that's too likely to hit pathological performance cases.  I
>> don't know if the weights have changed here: the C906 is currently by
>> far the cheapest chip (which likely means it's going to be the most
>> popular), but it's unclear as to whether it's even RISC-V compliant and
>> I don't know of many people who've actually gotten one.  IMO it's too
>> early to flip -Os over to this behavior, we at least need to know that
>> this chip is going to be sufficiently RISC-V-ey that standard software
>> will even run on it much less be popular.
>>
>> >
>> >
>> >>   it's unclear
>> >>   that this conversion will actually manifst a performance improvement.
>> >>   I don't have a C906 and don't know what workloads people care about
>> >>   running on one, but I'm certainly not convinced this is a win --
>> >>   what's listed here looks to be the best case, and that's only saving
>> >>   two instructions to generate a pretty pathological sequence
>> >>   (misaligned access that conflicts with a prior store).
>>
>> Ah, I guess there it was ;)
>>
>> >>
>> >> Jojo: do you have any description of the C906 pipeline?  Specifically in
>> >> this case it'd be good to know how it handles unaligned accesses.
>> >>
>> >> >
>> >> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
>> >> > <gcc-patches@gcc.gnu.org> wrote:
>> >> >>
>> >> >> This patch enables the overlap-by-pieces feature of the by-pieces
>> >> >> infrastructure for inlining builtins in case the target has set
>> >> >> riscv_slow_unaligned_access_p to false.
>> >> >>
>> >> >> To demonstrate the effect for targets with fast unaligned access,
>> >> >> the following code sequences are generated for a 15-byte memset-zero.
>> >> >>
>> >> >> Without overlap_op_by_pieces we get:
>> >> >>   8e:   00053023                sd      zero,0(a0)
>> >> >>   92:   00052423                sw      zero,8(a0)
>> >> >>   96:   00051623                sh      zero,12(a0)
>> >> >>   9a:   00050723                sb      zero,14(a0)
>> >> >>
>> >> >> With overlap_op_by_pieces we get:
>> >> >>   7e:   00053023                sd      zero,0(a0)
>> >> >>   82:   000533a3                sd      zero,7(a0)
>> >> >>
>> >> >> gcc/ChangeLog:
>> >> >>
>> >> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
>> >> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
>> >> >>         riscv_overlap_op_by_pieces.
>> >> >>
>> >> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
>> >> >> ---
>> >> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
>> >> >>  1 file changed, 11 insertions(+)
>> >> >>
>> >> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
>> >> >> index 576960bb37c..98c76ba657a 100644
>> >> >> --- a/gcc/config/riscv/riscv.c
>> >> >> +++ b/gcc/config/riscv/riscv.c
>> >> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
>> >> >>    return riscv_slow_unaligned_access_p;
>> >> >>  }
>> >> >>
>> >> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
>> >> >> +
>> >> >> +static bool
>> >> >> +riscv_overlap_op_by_pieces (void)
>> >> >> +{
>> >> >> +  return !riscv_slow_unaligned_access_p;
>> >> >> +}
>> >> >> +
>> >> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
>> >> >>
>> >> >>  static bool
>> >> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
>> >> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
>> >> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
>> >> >>
>> >> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
>> >> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
>> >> >> +
>> >> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
>> >> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
>> >> >>
>> >> >> --
>> >> >> 2.31.1
>> >> >>
Christoph Müllner July 29, 2021, 7:36 p.m. UTC | #10
On Thu, Jul 29, 2021 at 8:54 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Tue, 27 Jul 2021 02:32:12 PDT (-0700), cmuellner@gcc.gnu.org wrote:
> > Ok, so if I understand correctly Palmer and Andrew prefer
> > overlap_op_by_pieces to be controlled
> > by its own field in the riscv_tune_param struct and not by the field
> > slow_unaligned_access in this struct
> > (i.e. slow_unaligned_access==false is not enough to imply
> > overlap_op_by_pieces==true).
>
> I guess, but I'm not really worried about this at that level of detail
> right now.  It's not like the tune structures form any sort of external
> interface we have to keep stable, we can do whatever we want with those
> fields so I'd just aim for encoding the desired behavior as simply as
> possible rather than trying to build something extensible.
>
> There are really two questions we need to answer: is this code actually
> faster for the C906, and is this what the average users wants under -Os.

I never mentioned -Os.
My main goal is code compiled for -O2, -O3 or even -Ofast.
And I want to execute code as fast as possible.

Loading hot data from cache is faster when being done by a single
load-word instruction than 4 load-byte instructions.
Less instructions implies less pressure for the instruction cache.
Less instructions implies less work for a CPU pipeline.
Architectures, which don't have a penalty for unaligned accesses
therefore observe a performance benefit.

What I understand from Andrew's email is that it is not that simple
and implementation might have a penalty for overlapping accesses
that is high enough to avoid them. I don't have the details for C906,
so I can't say if that's the case.

> That first one is pretty easy: just running those simple code sequences
> under a sweep of page offsets should be sufficient to determine if this
> is always faster (in which case it's an easy yes), if it's always slower
> (an easy no), or if there's some slow cases like page/cache line
> crossing (in which case we'd need to think a bit).
>
> The second one is a bit tricker.  In the past we'd said these sort of
> "actively misalign accesses to generate smaller code" sort of thing
> isn't suitable for -Os (as most machines still have very slow unaligned
> accesses) but is suitable for -Oz (don't remember if that ever ended up
> in GCC, though).  That still seems like a reasonable decision, but if it
> turns out that implementations with fast unaligned accesses become the
> norm then it'd probably be worth revisiting it.  Not sure exactly how to
> determine that tipping point, but I think we're a long way away from it
> right now.
>
> IMO it's really just premature to try and design an encoding of the
> tuning paramaters until we have an idea of what they are, as we'll just
> end up devolving down the path of trying to encode all possible hardware
> and that's generally a huge waste of time.  Since there's no ABI here we
> can refactor this however we want as new tunings show up.

I guess you mean that there needs to be a clear benefit for a supported
machine in GCC. Either obviously (see below), by measurement results,
or by decision
of the machine's maintainer (especially if the decision is a trade-off).

>
> > I don't have access to pipeline details that give proof that there are cases
> > where this patch causes a performance penalty.
> >
> > So, I leave this here as a summary for someone who has enough information and
> > interest to move this forward:
> > * the original patch should be sufficient, but does not have tests:
> >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html
> > * the tests can be taken from this patch:
> >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html
> >   Note, that there is a duplicated "sw" in builtins-overlap-6.c, which
> > should be a "sd".
> >
> > Thanks for the feedback!
>
> Cool.  Looks like the C906 is starting to show up in the real world, so
> we should be able to find someone who has access to one and cares enough
> to at least run some simple benchamrks of these code sequences.  IMO
> that's a pretty low interest bar, so I don't see any harm in waiting --
> when the hardware is common then I'm sure someone will care enough to
> give this a shot, and until then it's not really impacting anyone either
> way.
>
> The -Os thing is a bigger discussion, and while I'm happy to have it I
> don't really think we're even close to these being common enough yet.  I
> saw your memmove patch and think the same rationale might apply there,
> but I haven't looked closely and won't have time to for a bit as I've
> got to get around to the other projects.

The cpymemsi patch is also targeting -O2 or higher for fast code execution.
And it is one of the cases where there is an obvious performance benefit
for all machines that have slow_unaligned_access==false.

At the moment the cpymemsi expansion for RISC-V is implemented as if
there is no machine with slow_unaligned_access==false.
And in fact there is a machine already in GCC mainline with this property: C906.

Machines that can do fast unaligned accesses should not be wasting their
cycles with load-store-pairs of bytes, if they can do load-store pairs of words.





>
> > On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> >>
> >> On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:
> >> > On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> >> >>
> >> >> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> >> >> > Could you add a testcase? Otherwise LGTM.
> >> >> >
> >> >> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> >> >> > void foo(char *dst){
> >> >> >    __builtin_memset(dst, 0, 15);
> >> >> > }
> >> >>
> >> >> I'd like to see:
> >> >>
> >> >> * Test results.  This is only on for one target right now, so relying on
> >> >>   it to just work on others isn't a good idea.
> >> >> * Something to demonstrate this doesn't break -mstrict-align.
> >> >> * Some sort of performance analysis.  Most machines that support
> >> >>   unaligned access do so with some performance degredation,
> >> >
> >> > Also, some machines that gracefully support misaligned accesses under
> >> > most circumstances nevertheless experience a perf degradation when the
> >> > load depends on two stores that overlap partially but not fully.  This
> >> > transformation will obviously trigger such behavior from time to time.
> >>
> >> Ya, I thought I wrote a response to this but I guess it's just in a
> >> buffer somewhere.  The code sequences this is generating are really the
> >> worst case for unaligned stores: one of them is always guaranteed to be
> >> misaligned, and it partially overlaps with a store one cycle away.
> >>
> >> We're really only saving a handful of instructions at best here, so
> >> there's not much room for error when it comes to these sorts of things.
> >> Even if this difficult case is handled fast we're only talking about
> >> saving 2 cycles, which is pretty borderline as the single-issue in-order
> >> machines I've worked with that do support misaligned accesses tend to
> >> take at least a few cycles of performance hit on misaligned accesses.
> >> Even if misaligned accesses are single cycle, some back of the envelope
> >> calculations says that a pipeline flush when crossing a cache line would
> >> definitely push this negative and one per page crossing would be in the
> >> margins (depending on how we assume the original accesses are aligned).
> >>
> >> This is way too subtle of a thing to analyze without at least some
> >> knowledge of how this pipeline works, whether that's from a benchmark or
> >> just a pipeline description.
> >>
> >> > Note, I'm not objecting to this patch; I'm only responding to Palmer's
> >> > comment.  It makes sense to enable this kind of optimization for
> >> > -mtune=native etc., just not for standard software distributions.
> >>
> >> IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native
> >> is sort of a red herring, it's just syntax).  For -mtune=c906 I'm happy
> >> enabling this as long as it's actually correct and improves performance,
> >> but that'll need at least some hardware-oriented analysis -- whether
> >> it's a benchmark or just a confirmation that these sequences are
> >> actually expected to run fast.
> >>
> >> -Os is a different case, though.  Last time this came up we decided that
> >> -Os shouldn't purposefully misalign accesses, even when that saves code
> >> size, as that's too likely to hit pathological performance cases.  I
> >> don't know if the weights have changed here: the C906 is currently by
> >> far the cheapest chip (which likely means it's going to be the most
> >> popular), but it's unclear as to whether it's even RISC-V compliant and
> >> I don't know of many people who've actually gotten one.  IMO it's too
> >> early to flip -Os over to this behavior, we at least need to know that
> >> this chip is going to be sufficiently RISC-V-ey that standard software
> >> will even run on it much less be popular.
> >>
> >> >
> >> >
> >> >>   it's unclear
> >> >>   that this conversion will actually manifst a performance improvement.
> >> >>   I don't have a C906 and don't know what workloads people care about
> >> >>   running on one, but I'm certainly not convinced this is a win --
> >> >>   what's listed here looks to be the best case, and that's only saving
> >> >>   two instructions to generate a pretty pathological sequence
> >> >>   (misaligned access that conflicts with a prior store).
> >>
> >> Ah, I guess there it was ;)
> >>
> >> >>
> >> >> Jojo: do you have any description of the C906 pipeline?  Specifically in
> >> >> this case it'd be good to know how it handles unaligned accesses.
> >> >>
> >> >> >
> >> >> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> >> >> > <gcc-patches@gcc.gnu.org> wrote:
> >> >> >>
> >> >> >> This patch enables the overlap-by-pieces feature of the by-pieces
> >> >> >> infrastructure for inlining builtins in case the target has set
> >> >> >> riscv_slow_unaligned_access_p to false.
> >> >> >>
> >> >> >> To demonstrate the effect for targets with fast unaligned access,
> >> >> >> the following code sequences are generated for a 15-byte memset-zero.
> >> >> >>
> >> >> >> Without overlap_op_by_pieces we get:
> >> >> >>   8e:   00053023                sd      zero,0(a0)
> >> >> >>   92:   00052423                sw      zero,8(a0)
> >> >> >>   96:   00051623                sh      zero,12(a0)
> >> >> >>   9a:   00050723                sb      zero,14(a0)
> >> >> >>
> >> >> >> With overlap_op_by_pieces we get:
> >> >> >>   7e:   00053023                sd      zero,0(a0)
> >> >> >>   82:   000533a3                sd      zero,7(a0)
> >> >> >>
> >> >> >> gcc/ChangeLog:
> >> >> >>
> >> >> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
> >> >> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
> >> >> >>         riscv_overlap_op_by_pieces.
> >> >> >>
> >> >> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
> >> >> >> ---
> >> >> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
> >> >> >>  1 file changed, 11 insertions(+)
> >> >> >>
> >> >> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> >> >> >> index 576960bb37c..98c76ba657a 100644
> >> >> >> --- a/gcc/config/riscv/riscv.c
> >> >> >> +++ b/gcc/config/riscv/riscv.c
> >> >> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
> >> >> >>    return riscv_slow_unaligned_access_p;
> >> >> >>  }
> >> >> >>
> >> >> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> >> >> >> +
> >> >> >> +static bool
> >> >> >> +riscv_overlap_op_by_pieces (void)
> >> >> >> +{
> >> >> >> +  return !riscv_slow_unaligned_access_p;
> >> >> >> +}
> >> >> >> +
> >> >> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
> >> >> >>
> >> >> >>  static bool
> >> >> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
> >> >> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
> >> >> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
> >> >> >>
> >> >> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> >> >> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> >> >> >> +
> >> >> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
> >> >> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
> >> >> >>
> >> >> >> --
> >> >> >> 2.31.1
> >> >> >>
Christoph Müllner Aug. 5, 2021, 9:11 a.m. UTC | #11
Ping.

On Thu, Jul 29, 2021 at 9:36 PM Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
>
> On Thu, Jul 29, 2021 at 8:54 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> >
> > On Tue, 27 Jul 2021 02:32:12 PDT (-0700), cmuellner@gcc.gnu.org wrote:
> > > Ok, so if I understand correctly Palmer and Andrew prefer
> > > overlap_op_by_pieces to be controlled
> > > by its own field in the riscv_tune_param struct and not by the field
> > > slow_unaligned_access in this struct
> > > (i.e. slow_unaligned_access==false is not enough to imply
> > > overlap_op_by_pieces==true).
> >
> > I guess, but I'm not really worried about this at that level of detail
> > right now.  It's not like the tune structures form any sort of external
> > interface we have to keep stable, we can do whatever we want with those
> > fields so I'd just aim for encoding the desired behavior as simply as
> > possible rather than trying to build something extensible.
> >
> > There are really two questions we need to answer: is this code actually
> > faster for the C906, and is this what the average users wants under -Os.
>
> I never mentioned -Os.
> My main goal is code compiled for -O2, -O3 or even -Ofast.
> And I want to execute code as fast as possible.
>
> Loading hot data from cache is faster when being done by a single
> load-word instruction than 4 load-byte instructions.
> Less instructions implies less pressure for the instruction cache.
> Less instructions implies less work for a CPU pipeline.
> Architectures, which don't have a penalty for unaligned accesses
> therefore observe a performance benefit.
>
> What I understand from Andrew's email is that it is not that simple
> and implementation might have a penalty for overlapping accesses
> that is high enough to avoid them. I don't have the details for C906,
> so I can't say if that's the case.
>
> > That first one is pretty easy: just running those simple code sequences
> > under a sweep of page offsets should be sufficient to determine if this
> > is always faster (in which case it's an easy yes), if it's always slower
> > (an easy no), or if there's some slow cases like page/cache line
> > crossing (in which case we'd need to think a bit).
> >
> > The second one is a bit tricker.  In the past we'd said these sort of
> > "actively misalign accesses to generate smaller code" sort of thing
> > isn't suitable for -Os (as most machines still have very slow unaligned
> > accesses) but is suitable for -Oz (don't remember if that ever ended up
> > in GCC, though).  That still seems like a reasonable decision, but if it
> > turns out that implementations with fast unaligned accesses become the
> > norm then it'd probably be worth revisiting it.  Not sure exactly how to
> > determine that tipping point, but I think we're a long way away from it
> > right now.
> >
> > IMO it's really just premature to try and design an encoding of the
> > tuning paramaters until we have an idea of what they are, as we'll just
> > end up devolving down the path of trying to encode all possible hardware
> > and that's generally a huge waste of time.  Since there's no ABI here we
> > can refactor this however we want as new tunings show up.
>
> I guess you mean that there needs to be a clear benefit for a supported
> machine in GCC. Either obviously (see below), by measurement results,
> or by decision
> of the machine's maintainer (especially if the decision is a trade-off).
>
> >
> > > I don't have access to pipeline details that give proof that there are cases
> > > where this patch causes a performance penalty.
> > >
> > > So, I leave this here as a summary for someone who has enough information and
> > > interest to move this forward:
> > > * the original patch should be sufficient, but does not have tests:
> > >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html
> > > * the tests can be taken from this patch:
> > >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html
> > >   Note, that there is a duplicated "sw" in builtins-overlap-6.c, which
> > > should be a "sd".
> > >
> > > Thanks for the feedback!
> >
> > Cool.  Looks like the C906 is starting to show up in the real world, so
> > we should be able to find someone who has access to one and cares enough
> > to at least run some simple benchamrks of these code sequences.  IMO
> > that's a pretty low interest bar, so I don't see any harm in waiting --
> > when the hardware is common then I'm sure someone will care enough to
> > give this a shot, and until then it's not really impacting anyone either
> > way.
> >
> > The -Os thing is a bigger discussion, and while I'm happy to have it I
> > don't really think we're even close to these being common enough yet.  I
> > saw your memmove patch and think the same rationale might apply there,
> > but I haven't looked closely and won't have time to for a bit as I've
> > got to get around to the other projects.
>
> The cpymemsi patch is also targeting -O2 or higher for fast code execution.
> And it is one of the cases where there is an obvious performance benefit
> for all machines that have slow_unaligned_access==false.
>
> At the moment the cpymemsi expansion for RISC-V is implemented as if
> there is no machine with slow_unaligned_access==false.
> And in fact there is a machine already in GCC mainline with this property: C906.
>
> Machines that can do fast unaligned accesses should not be wasting their
> cycles with load-store-pairs of bytes, if they can do load-store pairs of words.
>
>
>
>
>
> >
> > > On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> > >>
> > >> On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:
> > >> > On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> > >> >>
> > >> >> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> > >> >> > Could you add a testcase? Otherwise LGTM.
> > >> >> >
> > >> >> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> > >> >> > void foo(char *dst){
> > >> >> >    __builtin_memset(dst, 0, 15);
> > >> >> > }
> > >> >>
> > >> >> I'd like to see:
> > >> >>
> > >> >> * Test results.  This is only on for one target right now, so relying on
> > >> >>   it to just work on others isn't a good idea.
> > >> >> * Something to demonstrate this doesn't break -mstrict-align.
> > >> >> * Some sort of performance analysis.  Most machines that support
> > >> >>   unaligned access do so with some performance degredation,
> > >> >
> > >> > Also, some machines that gracefully support misaligned accesses under
> > >> > most circumstances nevertheless experience a perf degradation when the
> > >> > load depends on two stores that overlap partially but not fully.  This
> > >> > transformation will obviously trigger such behavior from time to time.
> > >>
> > >> Ya, I thought I wrote a response to this but I guess it's just in a
> > >> buffer somewhere.  The code sequences this is generating are really the
> > >> worst case for unaligned stores: one of them is always guaranteed to be
> > >> misaligned, and it partially overlaps with a store one cycle away.
> > >>
> > >> We're really only saving a handful of instructions at best here, so
> > >> there's not much room for error when it comes to these sorts of things.
> > >> Even if this difficult case is handled fast we're only talking about
> > >> saving 2 cycles, which is pretty borderline as the single-issue in-order
> > >> machines I've worked with that do support misaligned accesses tend to
> > >> take at least a few cycles of performance hit on misaligned accesses.
> > >> Even if misaligned accesses are single cycle, some back of the envelope
> > >> calculations says that a pipeline flush when crossing a cache line would
> > >> definitely push this negative and one per page crossing would be in the
> > >> margins (depending on how we assume the original accesses are aligned).
> > >>
> > >> This is way too subtle of a thing to analyze without at least some
> > >> knowledge of how this pipeline works, whether that's from a benchmark or
> > >> just a pipeline description.
> > >>
> > >> > Note, I'm not objecting to this patch; I'm only responding to Palmer's
> > >> > comment.  It makes sense to enable this kind of optimization for
> > >> > -mtune=native etc., just not for standard software distributions.
> > >>
> > >> IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native
> > >> is sort of a red herring, it's just syntax).  For -mtune=c906 I'm happy
> > >> enabling this as long as it's actually correct and improves performance,
> > >> but that'll need at least some hardware-oriented analysis -- whether
> > >> it's a benchmark or just a confirmation that these sequences are
> > >> actually expected to run fast.
> > >>
> > >> -Os is a different case, though.  Last time this came up we decided that
> > >> -Os shouldn't purposefully misalign accesses, even when that saves code
> > >> size, as that's too likely to hit pathological performance cases.  I
> > >> don't know if the weights have changed here: the C906 is currently by
> > >> far the cheapest chip (which likely means it's going to be the most
> > >> popular), but it's unclear as to whether it's even RISC-V compliant and
> > >> I don't know of many people who've actually gotten one.  IMO it's too
> > >> early to flip -Os over to this behavior, we at least need to know that
> > >> this chip is going to be sufficiently RISC-V-ey that standard software
> > >> will even run on it much less be popular.
> > >>
> > >> >
> > >> >
> > >> >>   it's unclear
> > >> >>   that this conversion will actually manifst a performance improvement.
> > >> >>   I don't have a C906 and don't know what workloads people care about
> > >> >>   running on one, but I'm certainly not convinced this is a win --
> > >> >>   what's listed here looks to be the best case, and that's only saving
> > >> >>   two instructions to generate a pretty pathological sequence
> > >> >>   (misaligned access that conflicts with a prior store).
> > >>
> > >> Ah, I guess there it was ;)
> > >>
> > >> >>
> > >> >> Jojo: do you have any description of the C906 pipeline?  Specifically in
> > >> >> this case it'd be good to know how it handles unaligned accesses.
> > >> >>
> > >> >> >
> > >> >> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> > >> >> > <gcc-patches@gcc.gnu.org> wrote:
> > >> >> >>
> > >> >> >> This patch enables the overlap-by-pieces feature of the by-pieces
> > >> >> >> infrastructure for inlining builtins in case the target has set
> > >> >> >> riscv_slow_unaligned_access_p to false.
> > >> >> >>
> > >> >> >> To demonstrate the effect for targets with fast unaligned access,
> > >> >> >> the following code sequences are generated for a 15-byte memset-zero.
> > >> >> >>
> > >> >> >> Without overlap_op_by_pieces we get:
> > >> >> >>   8e:   00053023                sd      zero,0(a0)
> > >> >> >>   92:   00052423                sw      zero,8(a0)
> > >> >> >>   96:   00051623                sh      zero,12(a0)
> > >> >> >>   9a:   00050723                sb      zero,14(a0)
> > >> >> >>
> > >> >> >> With overlap_op_by_pieces we get:
> > >> >> >>   7e:   00053023                sd      zero,0(a0)
> > >> >> >>   82:   000533a3                sd      zero,7(a0)
> > >> >> >>
> > >> >> >> gcc/ChangeLog:
> > >> >> >>
> > >> >> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
> > >> >> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
> > >> >> >>         riscv_overlap_op_by_pieces.
> > >> >> >>
> > >> >> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
> > >> >> >> ---
> > >> >> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
> > >> >> >>  1 file changed, 11 insertions(+)
> > >> >> >>
> > >> >> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> > >> >> >> index 576960bb37c..98c76ba657a 100644
> > >> >> >> --- a/gcc/config/riscv/riscv.c
> > >> >> >> +++ b/gcc/config/riscv/riscv.c
> > >> >> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
> > >> >> >>    return riscv_slow_unaligned_access_p;
> > >> >> >>  }
> > >> >> >>
> > >> >> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> > >> >> >> +
> > >> >> >> +static bool
> > >> >> >> +riscv_overlap_op_by_pieces (void)
> > >> >> >> +{
> > >> >> >> +  return !riscv_slow_unaligned_access_p;
> > >> >> >> +}
> > >> >> >> +
> > >> >> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
> > >> >> >>
> > >> >> >>  static bool
> > >> >> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
> > >> >> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
> > >> >> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
> > >> >> >>
> > >> >> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> > >> >> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> > >> >> >> +
> > >> >> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
> > >> >> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
> > >> >> >>
> > >> >> >> --
> > >> >> >> 2.31.1
> > >> >> >>
Christoph Müllner Aug. 13, 2021, 5:52 p.m. UTC | #12
Ping.

On Thu, Aug 5, 2021 at 11:11 AM Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
>
> Ping.
>
> On Thu, Jul 29, 2021 at 9:36 PM Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
> >
> > On Thu, Jul 29, 2021 at 8:54 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> > >
> > > On Tue, 27 Jul 2021 02:32:12 PDT (-0700), cmuellner@gcc.gnu.org wrote:
> > > > Ok, so if I understand correctly Palmer and Andrew prefer
> > > > overlap_op_by_pieces to be controlled
> > > > by its own field in the riscv_tune_param struct and not by the field
> > > > slow_unaligned_access in this struct
> > > > (i.e. slow_unaligned_access==false is not enough to imply
> > > > overlap_op_by_pieces==true).
> > >
> > > I guess, but I'm not really worried about this at that level of detail
> > > right now.  It's not like the tune structures form any sort of external
> > > interface we have to keep stable, we can do whatever we want with those
> > > fields so I'd just aim for encoding the desired behavior as simply as
> > > possible rather than trying to build something extensible.
> > >
> > > There are really two questions we need to answer: is this code actually
> > > faster for the C906, and is this what the average users wants under -Os.
> >
> > I never mentioned -Os.
> > My main goal is code compiled for -O2, -O3 or even -Ofast.
> > And I want to execute code as fast as possible.
> >
> > Loading hot data from cache is faster when being done by a single
> > load-word instruction than 4 load-byte instructions.
> > Less instructions implies less pressure for the instruction cache.
> > Less instructions implies less work for a CPU pipeline.
> > Architectures, which don't have a penalty for unaligned accesses
> > therefore observe a performance benefit.
> >
> > What I understand from Andrew's email is that it is not that simple
> > and implementation might have a penalty for overlapping accesses
> > that is high enough to avoid them. I don't have the details for C906,
> > so I can't say if that's the case.
> >
> > > That first one is pretty easy: just running those simple code sequences
> > > under a sweep of page offsets should be sufficient to determine if this
> > > is always faster (in which case it's an easy yes), if it's always slower
> > > (an easy no), or if there's some slow cases like page/cache line
> > > crossing (in which case we'd need to think a bit).
> > >
> > > The second one is a bit tricker.  In the past we'd said these sort of
> > > "actively misalign accesses to generate smaller code" sort of thing
> > > isn't suitable for -Os (as most machines still have very slow unaligned
> > > accesses) but is suitable for -Oz (don't remember if that ever ended up
> > > in GCC, though).  That still seems like a reasonable decision, but if it
> > > turns out that implementations with fast unaligned accesses become the
> > > norm then it'd probably be worth revisiting it.  Not sure exactly how to
> > > determine that tipping point, but I think we're a long way away from it
> > > right now.
> > >
> > > IMO it's really just premature to try and design an encoding of the
> > > tuning paramaters until we have an idea of what they are, as we'll just
> > > end up devolving down the path of trying to encode all possible hardware
> > > and that's generally a huge waste of time.  Since there's no ABI here we
> > > can refactor this however we want as new tunings show up.
> >
> > I guess you mean that there needs to be a clear benefit for a supported
> > machine in GCC. Either obviously (see below), by measurement results,
> > or by decision
> > of the machine's maintainer (especially if the decision is a trade-off).
> >
> > >
> > > > I don't have access to pipeline details that give proof that there are cases
> > > > where this patch causes a performance penalty.
> > > >
> > > > So, I leave this here as a summary for someone who has enough information and
> > > > interest to move this forward:
> > > > * the original patch should be sufficient, but does not have tests:
> > > >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html
> > > > * the tests can be taken from this patch:
> > > >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html
> > > >   Note, that there is a duplicated "sw" in builtins-overlap-6.c, which
> > > > should be a "sd".
> > > >
> > > > Thanks for the feedback!
> > >
> > > Cool.  Looks like the C906 is starting to show up in the real world, so
> > > we should be able to find someone who has access to one and cares enough
> > > to at least run some simple benchamrks of these code sequences.  IMO
> > > that's a pretty low interest bar, so I don't see any harm in waiting --
> > > when the hardware is common then I'm sure someone will care enough to
> > > give this a shot, and until then it's not really impacting anyone either
> > > way.
> > >
> > > The -Os thing is a bigger discussion, and while I'm happy to have it I
> > > don't really think we're even close to these being common enough yet.  I
> > > saw your memmove patch and think the same rationale might apply there,
> > > but I haven't looked closely and won't have time to for a bit as I've
> > > got to get around to the other projects.
> >
> > The cpymemsi patch is also targeting -O2 or higher for fast code execution.
> > And it is one of the cases where there is an obvious performance benefit
> > for all machines that have slow_unaligned_access==false.
> >
> > At the moment the cpymemsi expansion for RISC-V is implemented as if
> > there is no machine with slow_unaligned_access==false.
> > And in fact there is a machine already in GCC mainline with this property: C906.
> >
> > Machines that can do fast unaligned accesses should not be wasting their
> > cycles with load-store-pairs of bytes, if they can do load-store pairs of words.
> >
> >
> >
> >
> >
> > >
> > > > On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> > > >>
> > > >> On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:
> > > >> > On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> > > >> >>
> > > >> >> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> > > >> >> > Could you add a testcase? Otherwise LGTM.
> > > >> >> >
> > > >> >> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> > > >> >> > void foo(char *dst){
> > > >> >> >    __builtin_memset(dst, 0, 15);
> > > >> >> > }
> > > >> >>
> > > >> >> I'd like to see:
> > > >> >>
> > > >> >> * Test results.  This is only on for one target right now, so relying on
> > > >> >>   it to just work on others isn't a good idea.
> > > >> >> * Something to demonstrate this doesn't break -mstrict-align.
> > > >> >> * Some sort of performance analysis.  Most machines that support
> > > >> >>   unaligned access do so with some performance degredation,
> > > >> >
> > > >> > Also, some machines that gracefully support misaligned accesses under
> > > >> > most circumstances nevertheless experience a perf degradation when the
> > > >> > load depends on two stores that overlap partially but not fully.  This
> > > >> > transformation will obviously trigger such behavior from time to time.
> > > >>
> > > >> Ya, I thought I wrote a response to this but I guess it's just in a
> > > >> buffer somewhere.  The code sequences this is generating are really the
> > > >> worst case for unaligned stores: one of them is always guaranteed to be
> > > >> misaligned, and it partially overlaps with a store one cycle away.
> > > >>
> > > >> We're really only saving a handful of instructions at best here, so
> > > >> there's not much room for error when it comes to these sorts of things.
> > > >> Even if this difficult case is handled fast we're only talking about
> > > >> saving 2 cycles, which is pretty borderline as the single-issue in-order
> > > >> machines I've worked with that do support misaligned accesses tend to
> > > >> take at least a few cycles of performance hit on misaligned accesses.
> > > >> Even if misaligned accesses are single cycle, some back of the envelope
> > > >> calculations says that a pipeline flush when crossing a cache line would
> > > >> definitely push this negative and one per page crossing would be in the
> > > >> margins (depending on how we assume the original accesses are aligned).
> > > >>
> > > >> This is way too subtle of a thing to analyze without at least some
> > > >> knowledge of how this pipeline works, whether that's from a benchmark or
> > > >> just a pipeline description.
> > > >>
> > > >> > Note, I'm not objecting to this patch; I'm only responding to Palmer's
> > > >> > comment.  It makes sense to enable this kind of optimization for
> > > >> > -mtune=native etc., just not for standard software distributions.
> > > >>
> > > >> IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native
> > > >> is sort of a red herring, it's just syntax).  For -mtune=c906 I'm happy
> > > >> enabling this as long as it's actually correct and improves performance,
> > > >> but that'll need at least some hardware-oriented analysis -- whether
> > > >> it's a benchmark or just a confirmation that these sequences are
> > > >> actually expected to run fast.
> > > >>
> > > >> -Os is a different case, though.  Last time this came up we decided that
> > > >> -Os shouldn't purposefully misalign accesses, even when that saves code
> > > >> size, as that's too likely to hit pathological performance cases.  I
> > > >> don't know if the weights have changed here: the C906 is currently by
> > > >> far the cheapest chip (which likely means it's going to be the most
> > > >> popular), but it's unclear as to whether it's even RISC-V compliant and
> > > >> I don't know of many people who've actually gotten one.  IMO it's too
> > > >> early to flip -Os over to this behavior, we at least need to know that
> > > >> this chip is going to be sufficiently RISC-V-ey that standard software
> > > >> will even run on it much less be popular.
> > > >>
> > > >> >
> > > >> >
> > > >> >>   it's unclear
> > > >> >>   that this conversion will actually manifst a performance improvement.
> > > >> >>   I don't have a C906 and don't know what workloads people care about
> > > >> >>   running on one, but I'm certainly not convinced this is a win --
> > > >> >>   what's listed here looks to be the best case, and that's only saving
> > > >> >>   two instructions to generate a pretty pathological sequence
> > > >> >>   (misaligned access that conflicts with a prior store).
> > > >>
> > > >> Ah, I guess there it was ;)
> > > >>
> > > >> >>
> > > >> >> Jojo: do you have any description of the C906 pipeline?  Specifically in
> > > >> >> this case it'd be good to know how it handles unaligned accesses.
> > > >> >>
> > > >> >> >
> > > >> >> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> > > >> >> > <gcc-patches@gcc.gnu.org> wrote:
> > > >> >> >>
> > > >> >> >> This patch enables the overlap-by-pieces feature of the by-pieces
> > > >> >> >> infrastructure for inlining builtins in case the target has set
> > > >> >> >> riscv_slow_unaligned_access_p to false.
> > > >> >> >>
> > > >> >> >> To demonstrate the effect for targets with fast unaligned access,
> > > >> >> >> the following code sequences are generated for a 15-byte memset-zero.
> > > >> >> >>
> > > >> >> >> Without overlap_op_by_pieces we get:
> > > >> >> >>   8e:   00053023                sd      zero,0(a0)
> > > >> >> >>   92:   00052423                sw      zero,8(a0)
> > > >> >> >>   96:   00051623                sh      zero,12(a0)
> > > >> >> >>   9a:   00050723                sb      zero,14(a0)
> > > >> >> >>
> > > >> >> >> With overlap_op_by_pieces we get:
> > > >> >> >>   7e:   00053023                sd      zero,0(a0)
> > > >> >> >>   82:   000533a3                sd      zero,7(a0)
> > > >> >> >>
> > > >> >> >> gcc/ChangeLog:
> > > >> >> >>
> > > >> >> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
> > > >> >> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
> > > >> >> >>         riscv_overlap_op_by_pieces.
> > > >> >> >>
> > > >> >> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
> > > >> >> >> ---
> > > >> >> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
> > > >> >> >>  1 file changed, 11 insertions(+)
> > > >> >> >>
> > > >> >> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> > > >> >> >> index 576960bb37c..98c76ba657a 100644
> > > >> >> >> --- a/gcc/config/riscv/riscv.c
> > > >> >> >> +++ b/gcc/config/riscv/riscv.c
> > > >> >> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
> > > >> >> >>    return riscv_slow_unaligned_access_p;
> > > >> >> >>  }
> > > >> >> >>
> > > >> >> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> > > >> >> >> +
> > > >> >> >> +static bool
> > > >> >> >> +riscv_overlap_op_by_pieces (void)
> > > >> >> >> +{
> > > >> >> >> +  return !riscv_slow_unaligned_access_p;
> > > >> >> >> +}
> > > >> >> >> +
> > > >> >> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
> > > >> >> >>
> > > >> >> >>  static bool
> > > >> >> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
> > > >> >> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
> > > >> >> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
> > > >> >> >>
> > > >> >> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> > > >> >> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> > > >> >> >> +
> > > >> >> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
> > > >> >> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
> > > >> >> >>
> > > >> >> >> --
> > > >> >> >> 2.31.1
> > > >> >> >>
Kito Cheng Aug. 16, 2021, 10:02 a.m. UTC | #13
HI Christoph:

Could you submit v3 patch which is v1 with overlap_op_by_pieces field,
testcase from v2 and add a few more comments to describe the field?

And add an -mtune=ultra-size to make it able to test without change
other behavior?

Hi Palmer:

Are you OK with that?


On Sat, Aug 14, 2021 at 1:54 AM Christoph Müllner via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Ping.
>
> On Thu, Aug 5, 2021 at 11:11 AM Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
> >
> > Ping.
> >
> > On Thu, Jul 29, 2021 at 9:36 PM Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
> > >
> > > On Thu, Jul 29, 2021 at 8:54 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> > > >
> > > > On Tue, 27 Jul 2021 02:32:12 PDT (-0700), cmuellner@gcc.gnu.org wrote:
> > > > > Ok, so if I understand correctly Palmer and Andrew prefer
> > > > > overlap_op_by_pieces to be controlled
> > > > > by its own field in the riscv_tune_param struct and not by the field
> > > > > slow_unaligned_access in this struct
> > > > > (i.e. slow_unaligned_access==false is not enough to imply
> > > > > overlap_op_by_pieces==true).
> > > >
> > > > I guess, but I'm not really worried about this at that level of detail
> > > > right now.  It's not like the tune structures form any sort of external
> > > > interface we have to keep stable, we can do whatever we want with those
> > > > fields so I'd just aim for encoding the desired behavior as simply as
> > > > possible rather than trying to build something extensible.
> > > >
> > > > There are really two questions we need to answer: is this code actually
> > > > faster for the C906, and is this what the average users wants under -Os.
> > >
> > > I never mentioned -Os.
> > > My main goal is code compiled for -O2, -O3 or even -Ofast.
> > > And I want to execute code as fast as possible.
> > >
> > > Loading hot data from cache is faster when being done by a single
> > > load-word instruction than 4 load-byte instructions.
> > > Less instructions implies less pressure for the instruction cache.
> > > Less instructions implies less work for a CPU pipeline.
> > > Architectures, which don't have a penalty for unaligned accesses
> > > therefore observe a performance benefit.
> > >
> > > What I understand from Andrew's email is that it is not that simple
> > > and implementation might have a penalty for overlapping accesses
> > > that is high enough to avoid them. I don't have the details for C906,
> > > so I can't say if that's the case.
> > >
> > > > That first one is pretty easy: just running those simple code sequences
> > > > under a sweep of page offsets should be sufficient to determine if this
> > > > is always faster (in which case it's an easy yes), if it's always slower
> > > > (an easy no), or if there's some slow cases like page/cache line
> > > > crossing (in which case we'd need to think a bit).
> > > >
> > > > The second one is a bit tricker.  In the past we'd said these sort of
> > > > "actively misalign accesses to generate smaller code" sort of thing
> > > > isn't suitable for -Os (as most machines still have very slow unaligned
> > > > accesses) but is suitable for -Oz (don't remember if that ever ended up
> > > > in GCC, though).  That still seems like a reasonable decision, but if it
> > > > turns out that implementations with fast unaligned accesses become the
> > > > norm then it'd probably be worth revisiting it.  Not sure exactly how to
> > > > determine that tipping point, but I think we're a long way away from it
> > > > right now.
> > > >
> > > > IMO it's really just premature to try and design an encoding of the
> > > > tuning paramaters until we have an idea of what they are, as we'll just
> > > > end up devolving down the path of trying to encode all possible hardware
> > > > and that's generally a huge waste of time.  Since there's no ABI here we
> > > > can refactor this however we want as new tunings show up.
> > >
> > > I guess you mean that there needs to be a clear benefit for a supported
> > > machine in GCC. Either obviously (see below), by measurement results,
> > > or by decision
> > > of the machine's maintainer (especially if the decision is a trade-off).
> > >
> > > >
> > > > > I don't have access to pipeline details that give proof that there are cases
> > > > > where this patch causes a performance penalty.
> > > > >
> > > > > So, I leave this here as a summary for someone who has enough information and
> > > > > interest to move this forward:
> > > > > * the original patch should be sufficient, but does not have tests:
> > > > >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html
> > > > > * the tests can be taken from this patch:
> > > > >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html
> > > > >   Note, that there is a duplicated "sw" in builtins-overlap-6.c, which
> > > > > should be a "sd".
> > > > >
> > > > > Thanks for the feedback!
> > > >
> > > > Cool.  Looks like the C906 is starting to show up in the real world, so
> > > > we should be able to find someone who has access to one and cares enough
> > > > to at least run some simple benchamrks of these code sequences.  IMO
> > > > that's a pretty low interest bar, so I don't see any harm in waiting --
> > > > when the hardware is common then I'm sure someone will care enough to
> > > > give this a shot, and until then it's not really impacting anyone either
> > > > way.
> > > >
> > > > The -Os thing is a bigger discussion, and while I'm happy to have it I
> > > > don't really think we're even close to these being common enough yet.  I
> > > > saw your memmove patch and think the same rationale might apply there,
> > > > but I haven't looked closely and won't have time to for a bit as I've
> > > > got to get around to the other projects.
> > >
> > > The cpymemsi patch is also targeting -O2 or higher for fast code execution.
> > > And it is one of the cases where there is an obvious performance benefit
> > > for all machines that have slow_unaligned_access==false.
> > >
> > > At the moment the cpymemsi expansion for RISC-V is implemented as if
> > > there is no machine with slow_unaligned_access==false.
> > > And in fact there is a machine already in GCC mainline with this property: C906.
> > >
> > > Machines that can do fast unaligned accesses should not be wasting their
> > > cycles with load-store-pairs of bytes, if they can do load-store pairs of words.
> > >
> > >
> > >
> > >
> > >
> > > >
> > > > > On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> > > > >>
> > > > >> On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:
> > > > >> > On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> > > > >> >>
> > > > >> >> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> > > > >> >> > Could you add a testcase? Otherwise LGTM.
> > > > >> >> >
> > > > >> >> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> > > > >> >> > void foo(char *dst){
> > > > >> >> >    __builtin_memset(dst, 0, 15);
> > > > >> >> > }
> > > > >> >>
> > > > >> >> I'd like to see:
> > > > >> >>
> > > > >> >> * Test results.  This is only on for one target right now, so relying on
> > > > >> >>   it to just work on others isn't a good idea.
> > > > >> >> * Something to demonstrate this doesn't break -mstrict-align.
> > > > >> >> * Some sort of performance analysis.  Most machines that support
> > > > >> >>   unaligned access do so with some performance degredation,
> > > > >> >
> > > > >> > Also, some machines that gracefully support misaligned accesses under
> > > > >> > most circumstances nevertheless experience a perf degradation when the
> > > > >> > load depends on two stores that overlap partially but not fully.  This
> > > > >> > transformation will obviously trigger such behavior from time to time.
> > > > >>
> > > > >> Ya, I thought I wrote a response to this but I guess it's just in a
> > > > >> buffer somewhere.  The code sequences this is generating are really the
> > > > >> worst case for unaligned stores: one of them is always guaranteed to be
> > > > >> misaligned, and it partially overlaps with a store one cycle away.
> > > > >>
> > > > >> We're really only saving a handful of instructions at best here, so
> > > > >> there's not much room for error when it comes to these sorts of things.
> > > > >> Even if this difficult case is handled fast we're only talking about
> > > > >> saving 2 cycles, which is pretty borderline as the single-issue in-order
> > > > >> machines I've worked with that do support misaligned accesses tend to
> > > > >> take at least a few cycles of performance hit on misaligned accesses.
> > > > >> Even if misaligned accesses are single cycle, some back of the envelope
> > > > >> calculations says that a pipeline flush when crossing a cache line would
> > > > >> definitely push this negative and one per page crossing would be in the
> > > > >> margins (depending on how we assume the original accesses are aligned).
> > > > >>
> > > > >> This is way too subtle of a thing to analyze without at least some
> > > > >> knowledge of how this pipeline works, whether that's from a benchmark or
> > > > >> just a pipeline description.
> > > > >>
> > > > >> > Note, I'm not objecting to this patch; I'm only responding to Palmer's
> > > > >> > comment.  It makes sense to enable this kind of optimization for
> > > > >> > -mtune=native etc., just not for standard software distributions.
> > > > >>
> > > > >> IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native
> > > > >> is sort of a red herring, it's just syntax).  For -mtune=c906 I'm happy
> > > > >> enabling this as long as it's actually correct and improves performance,
> > > > >> but that'll need at least some hardware-oriented analysis -- whether
> > > > >> it's a benchmark or just a confirmation that these sequences are
> > > > >> actually expected to run fast.
> > > > >>
> > > > >> -Os is a different case, though.  Last time this came up we decided that
> > > > >> -Os shouldn't purposefully misalign accesses, even when that saves code
> > > > >> size, as that's too likely to hit pathological performance cases.  I
> > > > >> don't know if the weights have changed here: the C906 is currently by
> > > > >> far the cheapest chip (which likely means it's going to be the most
> > > > >> popular), but it's unclear as to whether it's even RISC-V compliant and
> > > > >> I don't know of many people who've actually gotten one.  IMO it's too
> > > > >> early to flip -Os over to this behavior, we at least need to know that
> > > > >> this chip is going to be sufficiently RISC-V-ey that standard software
> > > > >> will even run on it much less be popular.
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> >>   it's unclear
> > > > >> >>   that this conversion will actually manifst a performance improvement.
> > > > >> >>   I don't have a C906 and don't know what workloads people care about
> > > > >> >>   running on one, but I'm certainly not convinced this is a win --
> > > > >> >>   what's listed here looks to be the best case, and that's only saving
> > > > >> >>   two instructions to generate a pretty pathological sequence
> > > > >> >>   (misaligned access that conflicts with a prior store).
> > > > >>
> > > > >> Ah, I guess there it was ;)
> > > > >>
> > > > >> >>
> > > > >> >> Jojo: do you have any description of the C906 pipeline?  Specifically in
> > > > >> >> this case it'd be good to know how it handles unaligned accesses.
> > > > >> >>
> > > > >> >> >
> > > > >> >> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> > > > >> >> > <gcc-patches@gcc.gnu.org> wrote:
> > > > >> >> >>
> > > > >> >> >> This patch enables the overlap-by-pieces feature of the by-pieces
> > > > >> >> >> infrastructure for inlining builtins in case the target has set
> > > > >> >> >> riscv_slow_unaligned_access_p to false.
> > > > >> >> >>
> > > > >> >> >> To demonstrate the effect for targets with fast unaligned access,
> > > > >> >> >> the following code sequences are generated for a 15-byte memset-zero.
> > > > >> >> >>
> > > > >> >> >> Without overlap_op_by_pieces we get:
> > > > >> >> >>   8e:   00053023                sd      zero,0(a0)
> > > > >> >> >>   92:   00052423                sw      zero,8(a0)
> > > > >> >> >>   96:   00051623                sh      zero,12(a0)
> > > > >> >> >>   9a:   00050723                sb      zero,14(a0)
> > > > >> >> >>
> > > > >> >> >> With overlap_op_by_pieces we get:
> > > > >> >> >>   7e:   00053023                sd      zero,0(a0)
> > > > >> >> >>   82:   000533a3                sd      zero,7(a0)
> > > > >> >> >>
> > > > >> >> >> gcc/ChangeLog:
> > > > >> >> >>
> > > > >> >> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
> > > > >> >> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
> > > > >> >> >>         riscv_overlap_op_by_pieces.
> > > > >> >> >>
> > > > >> >> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
> > > > >> >> >> ---
> > > > >> >> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
> > > > >> >> >>  1 file changed, 11 insertions(+)
> > > > >> >> >>
> > > > >> >> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> > > > >> >> >> index 576960bb37c..98c76ba657a 100644
> > > > >> >> >> --- a/gcc/config/riscv/riscv.c
> > > > >> >> >> +++ b/gcc/config/riscv/riscv.c
> > > > >> >> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
> > > > >> >> >>    return riscv_slow_unaligned_access_p;
> > > > >> >> >>  }
> > > > >> >> >>
> > > > >> >> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> > > > >> >> >> +
> > > > >> >> >> +static bool
> > > > >> >> >> +riscv_overlap_op_by_pieces (void)
> > > > >> >> >> +{
> > > > >> >> >> +  return !riscv_slow_unaligned_access_p;
> > > > >> >> >> +}
> > > > >> >> >> +
> > > > >> >> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
> > > > >> >> >>
> > > > >> >> >>  static bool
> > > > >> >> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
> > > > >> >> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
> > > > >> >> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
> > > > >> >> >>
> > > > >> >> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> > > > >> >> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> > > > >> >> >> +
> > > > >> >> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
> > > > >> >> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
> > > > >> >> >>
> > > > >> >> >> --
> > > > >> >> >> 2.31.1
> > > > >> >> >>
Palmer Dabbelt Aug. 16, 2021, 4:15 p.m. UTC | #14
On Mon, 16 Aug 2021 03:02:42 PDT (-0700), Kito Cheng wrote:
> HI Christoph:
>
> Could you submit v3 patch which is v1 with overlap_op_by_pieces field,
> testcase from v2 and add a few more comments to describe the field?
>
> And add an -mtune=ultra-size to make it able to test without change
> other behavior?
>
> Hi Palmer:
>
> Are you OK with that?

I'm still not convinced on the performance: like Andrew and I pointed 
out, this is a difficult case for pipelines of this flavor to handle.  
Nobody here knows anything about this pipeline deeply enough to say 
anything difinitive, though, so this is really just a guess.

As I'm not convinced this is an obvious performance win I'm not going to 
merge it without a benchmark.  If you're convinced and want to merge it 
that's fine, I don't really care about the performance fo the C906 and 
if someone complains we can always just revert it later.

> On Sat, Aug 14, 2021 at 1:54 AM Christoph Müllner via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Ping.
>>
>> On Thu, Aug 5, 2021 at 11:11 AM Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
>> >
>> > Ping.
>> >
>> > On Thu, Jul 29, 2021 at 9:36 PM Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
>> > >
>> > > On Thu, Jul 29, 2021 at 8:54 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>> > > >
>> > > > On Tue, 27 Jul 2021 02:32:12 PDT (-0700), cmuellner@gcc.gnu.org wrote:
>> > > > > Ok, so if I understand correctly Palmer and Andrew prefer
>> > > > > overlap_op_by_pieces to be controlled
>> > > > > by its own field in the riscv_tune_param struct and not by the field
>> > > > > slow_unaligned_access in this struct
>> > > > > (i.e. slow_unaligned_access==false is not enough to imply
>> > > > > overlap_op_by_pieces==true).
>> > > >
>> > > > I guess, but I'm not really worried about this at that level of detail
>> > > > right now.  It's not like the tune structures form any sort of external
>> > > > interface we have to keep stable, we can do whatever we want with those
>> > > > fields so I'd just aim for encoding the desired behavior as simply as
>> > > > possible rather than trying to build something extensible.
>> > > >
>> > > > There are really two questions we need to answer: is this code actually
>> > > > faster for the C906, and is this what the average users wants under -Os.
>> > >
>> > > I never mentioned -Os.
>> > > My main goal is code compiled for -O2, -O3 or even -Ofast.
>> > > And I want to execute code as fast as possible.
>> > >
>> > > Loading hot data from cache is faster when being done by a single
>> > > load-word instruction than 4 load-byte instructions.
>> > > Less instructions implies less pressure for the instruction cache.
>> > > Less instructions implies less work for a CPU pipeline.
>> > > Architectures, which don't have a penalty for unaligned accesses
>> > > therefore observe a performance benefit.
>> > >
>> > > What I understand from Andrew's email is that it is not that simple
>> > > and implementation might have a penalty for overlapping accesses
>> > > that is high enough to avoid them. I don't have the details for C906,
>> > > so I can't say if that's the case.
>> > >
>> > > > That first one is pretty easy: just running those simple code sequences
>> > > > under a sweep of page offsets should be sufficient to determine if this
>> > > > is always faster (in which case it's an easy yes), if it's always slower
>> > > > (an easy no), or if there's some slow cases like page/cache line
>> > > > crossing (in which case we'd need to think a bit).
>> > > >
>> > > > The second one is a bit tricker.  In the past we'd said these sort of
>> > > > "actively misalign accesses to generate smaller code" sort of thing
>> > > > isn't suitable for -Os (as most machines still have very slow unaligned
>> > > > accesses) but is suitable for -Oz (don't remember if that ever ended up
>> > > > in GCC, though).  That still seems like a reasonable decision, but if it
>> > > > turns out that implementations with fast unaligned accesses become the
>> > > > norm then it'd probably be worth revisiting it.  Not sure exactly how to
>> > > > determine that tipping point, but I think we're a long way away from it
>> > > > right now.
>> > > >
>> > > > IMO it's really just premature to try and design an encoding of the
>> > > > tuning paramaters until we have an idea of what they are, as we'll just
>> > > > end up devolving down the path of trying to encode all possible hardware
>> > > > and that's generally a huge waste of time.  Since there's no ABI here we
>> > > > can refactor this however we want as new tunings show up.
>> > >
>> > > I guess you mean that there needs to be a clear benefit for a supported
>> > > machine in GCC. Either obviously (see below), by measurement results,
>> > > or by decision
>> > > of the machine's maintainer (especially if the decision is a trade-off).
>> > >
>> > > >
>> > > > > I don't have access to pipeline details that give proof that there are cases
>> > > > > where this patch causes a performance penalty.
>> > > > >
>> > > > > So, I leave this here as a summary for someone who has enough information and
>> > > > > interest to move this forward:
>> > > > > * the original patch should be sufficient, but does not have tests:
>> > > > >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html
>> > > > > * the tests can be taken from this patch:
>> > > > >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html
>> > > > >   Note, that there is a duplicated "sw" in builtins-overlap-6.c, which
>> > > > > should be a "sd".
>> > > > >
>> > > > > Thanks for the feedback!
>> > > >
>> > > > Cool.  Looks like the C906 is starting to show up in the real world, so
>> > > > we should be able to find someone who has access to one and cares enough
>> > > > to at least run some simple benchamrks of these code sequences.  IMO
>> > > > that's a pretty low interest bar, so I don't see any harm in waiting --
>> > > > when the hardware is common then I'm sure someone will care enough to
>> > > > give this a shot, and until then it's not really impacting anyone either
>> > > > way.
>> > > >
>> > > > The -Os thing is a bigger discussion, and while I'm happy to have it I
>> > > > don't really think we're even close to these being common enough yet.  I
>> > > > saw your memmove patch and think the same rationale might apply there,
>> > > > but I haven't looked closely and won't have time to for a bit as I've
>> > > > got to get around to the other projects.
>> > >
>> > > The cpymemsi patch is also targeting -O2 or higher for fast code execution.
>> > > And it is one of the cases where there is an obvious performance benefit
>> > > for all machines that have slow_unaligned_access==false.
>> > >
>> > > At the moment the cpymemsi expansion for RISC-V is implemented as if
>> > > there is no machine with slow_unaligned_access==false.
>> > > And in fact there is a machine already in GCC mainline with this property: C906.
>> > >
>> > > Machines that can do fast unaligned accesses should not be wasting their
>> > > cycles with load-store-pairs of bytes, if they can do load-store pairs of words.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > >
>> > > > > On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>> > > > >>
>> > > > >> On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:
>> > > > >> > On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>> > > > >> >>
>> > > > >> >> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
>> > > > >> >> > Could you add a testcase? Otherwise LGTM.
>> > > > >> >> >
>> > > > >> >> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
>> > > > >> >> > void foo(char *dst){
>> > > > >> >> >    __builtin_memset(dst, 0, 15);
>> > > > >> >> > }
>> > > > >> >>
>> > > > >> >> I'd like to see:
>> > > > >> >>
>> > > > >> >> * Test results.  This is only on for one target right now, so relying on
>> > > > >> >>   it to just work on others isn't a good idea.
>> > > > >> >> * Something to demonstrate this doesn't break -mstrict-align.
>> > > > >> >> * Some sort of performance analysis.  Most machines that support
>> > > > >> >>   unaligned access do so with some performance degredation,
>> > > > >> >
>> > > > >> > Also, some machines that gracefully support misaligned accesses under
>> > > > >> > most circumstances nevertheless experience a perf degradation when the
>> > > > >> > load depends on two stores that overlap partially but not fully.  This
>> > > > >> > transformation will obviously trigger such behavior from time to time.
>> > > > >>
>> > > > >> Ya, I thought I wrote a response to this but I guess it's just in a
>> > > > >> buffer somewhere.  The code sequences this is generating are really the
>> > > > >> worst case for unaligned stores: one of them is always guaranteed to be
>> > > > >> misaligned, and it partially overlaps with a store one cycle away.
>> > > > >>
>> > > > >> We're really only saving a handful of instructions at best here, so
>> > > > >> there's not much room for error when it comes to these sorts of things.
>> > > > >> Even if this difficult case is handled fast we're only talking about
>> > > > >> saving 2 cycles, which is pretty borderline as the single-issue in-order
>> > > > >> machines I've worked with that do support misaligned accesses tend to
>> > > > >> take at least a few cycles of performance hit on misaligned accesses.
>> > > > >> Even if misaligned accesses are single cycle, some back of the envelope
>> > > > >> calculations says that a pipeline flush when crossing a cache line would
>> > > > >> definitely push this negative and one per page crossing would be in the
>> > > > >> margins (depending on how we assume the original accesses are aligned).
>> > > > >>
>> > > > >> This is way too subtle of a thing to analyze without at least some
>> > > > >> knowledge of how this pipeline works, whether that's from a benchmark or
>> > > > >> just a pipeline description.
>> > > > >>
>> > > > >> > Note, I'm not objecting to this patch; I'm only responding to Palmer's
>> > > > >> > comment.  It makes sense to enable this kind of optimization for
>> > > > >> > -mtune=native etc., just not for standard software distributions.
>> > > > >>
>> > > > >> IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native
>> > > > >> is sort of a red herring, it's just syntax).  For -mtune=c906 I'm happy
>> > > > >> enabling this as long as it's actually correct and improves performance,
>> > > > >> but that'll need at least some hardware-oriented analysis -- whether
>> > > > >> it's a benchmark or just a confirmation that these sequences are
>> > > > >> actually expected to run fast.
>> > > > >>
>> > > > >> -Os is a different case, though.  Last time this came up we decided that
>> > > > >> -Os shouldn't purposefully misalign accesses, even when that saves code
>> > > > >> size, as that's too likely to hit pathological performance cases.  I
>> > > > >> don't know if the weights have changed here: the C906 is currently by
>> > > > >> far the cheapest chip (which likely means it's going to be the most
>> > > > >> popular), but it's unclear as to whether it's even RISC-V compliant and
>> > > > >> I don't know of many people who've actually gotten one.  IMO it's too
>> > > > >> early to flip -Os over to this behavior, we at least need to know that
>> > > > >> this chip is going to be sufficiently RISC-V-ey that standard software
>> > > > >> will even run on it much less be popular.
>> > > > >>
>> > > > >> >
>> > > > >> >
>> > > > >> >>   it's unclear
>> > > > >> >>   that this conversion will actually manifst a performance improvement.
>> > > > >> >>   I don't have a C906 and don't know what workloads people care about
>> > > > >> >>   running on one, but I'm certainly not convinced this is a win --
>> > > > >> >>   what's listed here looks to be the best case, and that's only saving
>> > > > >> >>   two instructions to generate a pretty pathological sequence
>> > > > >> >>   (misaligned access that conflicts with a prior store).
>> > > > >>
>> > > > >> Ah, I guess there it was ;)
>> > > > >>
>> > > > >> >>
>> > > > >> >> Jojo: do you have any description of the C906 pipeline?  Specifically in
>> > > > >> >> this case it'd be good to know how it handles unaligned accesses.
>> > > > >> >>
>> > > > >> >> >
>> > > > >> >> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
>> > > > >> >> > <gcc-patches@gcc.gnu.org> wrote:
>> > > > >> >> >>
>> > > > >> >> >> This patch enables the overlap-by-pieces feature of the by-pieces
>> > > > >> >> >> infrastructure for inlining builtins in case the target has set
>> > > > >> >> >> riscv_slow_unaligned_access_p to false.
>> > > > >> >> >>
>> > > > >> >> >> To demonstrate the effect for targets with fast unaligned access,
>> > > > >> >> >> the following code sequences are generated for a 15-byte memset-zero.
>> > > > >> >> >>
>> > > > >> >> >> Without overlap_op_by_pieces we get:
>> > > > >> >> >>   8e:   00053023                sd      zero,0(a0)
>> > > > >> >> >>   92:   00052423                sw      zero,8(a0)
>> > > > >> >> >>   96:   00051623                sh      zero,12(a0)
>> > > > >> >> >>   9a:   00050723                sb      zero,14(a0)
>> > > > >> >> >>
>> > > > >> >> >> With overlap_op_by_pieces we get:
>> > > > >> >> >>   7e:   00053023                sd      zero,0(a0)
>> > > > >> >> >>   82:   000533a3                sd      zero,7(a0)
>> > > > >> >> >>
>> > > > >> >> >> gcc/ChangeLog:
>> > > > >> >> >>
>> > > > >> >> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
>> > > > >> >> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
>> > > > >> >> >>         riscv_overlap_op_by_pieces.
>> > > > >> >> >>
>> > > > >> >> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
>> > > > >> >> >> ---
>> > > > >> >> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
>> > > > >> >> >>  1 file changed, 11 insertions(+)
>> > > > >> >> >>
>> > > > >> >> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
>> > > > >> >> >> index 576960bb37c..98c76ba657a 100644
>> > > > >> >> >> --- a/gcc/config/riscv/riscv.c
>> > > > >> >> >> +++ b/gcc/config/riscv/riscv.c
>> > > > >> >> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
>> > > > >> >> >>    return riscv_slow_unaligned_access_p;
>> > > > >> >> >>  }
>> > > > >> >> >>
>> > > > >> >> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
>> > > > >> >> >> +
>> > > > >> >> >> +static bool
>> > > > >> >> >> +riscv_overlap_op_by_pieces (void)
>> > > > >> >> >> +{
>> > > > >> >> >> +  return !riscv_slow_unaligned_access_p;
>> > > > >> >> >> +}
>> > > > >> >> >> +
>> > > > >> >> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
>> > > > >> >> >>
>> > > > >> >> >>  static bool
>> > > > >> >> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
>> > > > >> >> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
>> > > > >> >> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
>> > > > >> >> >>
>> > > > >> >> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
>> > > > >> >> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
>> > > > >> >> >> +
>> > > > >> >> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
>> > > > >> >> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
>> > > > >> >> >>
>> > > > >> >> >> --
>> > > > >> >> >> 2.31.1
>> > > > >> >> >>
Kito Cheng Aug. 16, 2021, 4:29 p.m. UTC | #15
> > Could you submit v3 patch which is v1 with overlap_op_by_pieces field,
> > testcase from v2 and add a few more comments to describe the field?
> >
> > And add an -mtune=ultra-size to make it able to test without change
> > other behavior?
> >
> > Hi Palmer:
> >
> > Are you OK with that?
>
> I'm still not convinced on the performance: like Andrew and I pointed
> out, this is a difficult case for pipelines of this flavor to handle.
> Nobody here knows anything about this pipeline deeply enough to say
> anything difinitive, though, so this is really just a guess.

So with an extra field to indicate should resolve that?
I believe people should only set overlap_op_by_pieces
to true only if they are sure it has benefits.

> As I'm not convinced this is an obvious performance win I'm not going to
> merge it without a benchmark.  If you're convinced and want to merge it
> that's fine, I don't really care about the performance fo the C906 and
> if someone complains we can always just revert it later.

I suppose Christoph has tried with their internal processor, and it's
benefit on performance,
but it can't be open-source yet, so v2 patch set using C906 to demo
and test that since that is
the only processor with slow_unaligned_access=False.

I agree on the C906 part, we never know it's benefit or not, so I propose
adding one -mtune=ultra-size to make this test-able rather than changing C906.
Palmer Dabbelt Aug. 16, 2021, 5:09 p.m. UTC | #16
On Mon, 16 Aug 2021 09:29:16 PDT (-0700), Kito Cheng wrote:
>> > Could you submit v3 patch which is v1 with overlap_op_by_pieces field,
>> > testcase from v2 and add a few more comments to describe the field?
>> >
>> > And add an -mtune=ultra-size to make it able to test without change
>> > other behavior?
>> >
>> > Hi Palmer:
>> >
>> > Are you OK with that?
>>
>> I'm still not convinced on the performance: like Andrew and I pointed
>> out, this is a difficult case for pipelines of this flavor to handle.
>> Nobody here knows anything about this pipeline deeply enough to say
>> anything difinitive, though, so this is really just a guess.
>
> So with an extra field to indicate should resolve that?
> I believe people should only set overlap_op_by_pieces
> to true only if they are sure it has benefits.

My only issue there is that we'd have no way to turn it on, but see 
below...

>> As I'm not convinced this is an obvious performance win I'm not going to
>> merge it without a benchmark.  If you're convinced and want to merge it
>> that's fine, I don't really care about the performance fo the C906 and
>> if someone complains we can always just revert it later.
>
> I suppose Christoph has tried with their internal processor, and it's
> benefit on performance,
> but it can't be open-source yet, so v2 patch set using C906 to demo
> and test that since that is
> the only processor with slow_unaligned_access=False.

Well, that's a very different discussion.  The C906 tuning model should 
be for the C906, not a proxy for some internal-only processor.  If the 
goal here is to allow this pass to be flipped on by an out-of-tree 
pipeline model then we can talk about it.

> I agree on the C906 part, we never know it's benefit or not, so I propose
> adding one -mtune=ultra-size to make this test-able rather than changing C906.

That's essentially the same conclusion we came to last time this came 
up, except that we were calling it "-Oz" (because LLVM does).  I guess 
we never got around having the broader GCC discussion about "-Oz".  IIRC 
we had some other "-Oz" candidates we never got around to dealing with, 
but that was a while ago so I'm not sure if any of that panned out.
Andrew Pinski Aug. 16, 2021, 6:56 p.m. UTC | #17
On Mon, Aug 16, 2021 at 10:10 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Mon, 16 Aug 2021 09:29:16 PDT (-0700), Kito Cheng wrote:
> >> > Could you submit v3 patch which is v1 with overlap_op_by_pieces field,
> >> > testcase from v2 and add a few more comments to describe the field?
> >> >
> >> > And add an -mtune=ultra-size to make it able to test without change
> >> > other behavior?
> >> >
> >> > Hi Palmer:
> >> >
> >> > Are you OK with that?
> >>
> >> I'm still not convinced on the performance: like Andrew and I pointed
> >> out, this is a difficult case for pipelines of this flavor to handle.
> >> Nobody here knows anything about this pipeline deeply enough to say
> >> anything difinitive, though, so this is really just a guess.
> >
> > So with an extra field to indicate should resolve that?
> > I believe people should only set overlap_op_by_pieces
> > to true only if they are sure it has benefits.
>
> My only issue there is that we'd have no way to turn it on, but see
> below...
>
> >> As I'm not convinced this is an obvious performance win I'm not going to
> >> merge it without a benchmark.  If you're convinced and want to merge it
> >> that's fine, I don't really care about the performance fo the C906 and
> >> if someone complains we can always just revert it later.
> >
> > I suppose Christoph has tried with their internal processor, and it's
> > benefit on performance,
> > but it can't be open-source yet, so v2 patch set using C906 to demo
> > and test that since that is
> > the only processor with slow_unaligned_access=False.
>
> Well, that's a very different discussion.  The C906 tuning model should
> be for the C906, not a proxy for some internal-only processor.  If the
> goal here is to allow this pass to be flipped on by an out-of-tree
> pipeline model then we can talk about it.
>
> > I agree on the C906 part, we never know it's benefit or not, so I propose
> > adding one -mtune=ultra-size to make this test-able rather than changing C906.
>
> That's essentially the same conclusion we came to last time this came
> up, except that we were calling it "-Oz" (because LLVM does).  I guess
> we never got around having the broader GCC discussion about "-Oz".  IIRC
> we had some other "-Oz" candidates we never got around to dealing with,
> but that was a while ago so I'm not sure if any of that panned out.

-Oz was a bad idea that Apple came up because GCC decided to start
emitting store multiple on PowerPC around 13 years ago.
I don't think we should repeat that mistake for GCC and especially for RISCV.
If people want to optimize for size, they get the performance issues.

Thanks,
Andrew Pinski
Palmer Dabbelt Aug. 16, 2021, 8:01 p.m. UTC | #18
On Mon, 16 Aug 2021 11:56:05 PDT (-0700), pinskia@gmail.com wrote:
> On Mon, Aug 16, 2021 at 10:10 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>>
>> On Mon, 16 Aug 2021 09:29:16 PDT (-0700), Kito Cheng wrote:
>> >> > Could you submit v3 patch which is v1 with overlap_op_by_pieces field,
>> >> > testcase from v2 and add a few more comments to describe the field?
>> >> >
>> >> > And add an -mtune=ultra-size to make it able to test without change
>> >> > other behavior?
>> >> >
>> >> > Hi Palmer:
>> >> >
>> >> > Are you OK with that?
>> >>
>> >> I'm still not convinced on the performance: like Andrew and I pointed
>> >> out, this is a difficult case for pipelines of this flavor to handle.
>> >> Nobody here knows anything about this pipeline deeply enough to say
>> >> anything difinitive, though, so this is really just a guess.
>> >
>> > So with an extra field to indicate should resolve that?
>> > I believe people should only set overlap_op_by_pieces
>> > to true only if they are sure it has benefits.
>>
>> My only issue there is that we'd have no way to turn it on, but see
>> below...
>>
>> >> As I'm not convinced this is an obvious performance win I'm not going to
>> >> merge it without a benchmark.  If you're convinced and want to merge it
>> >> that's fine, I don't really care about the performance fo the C906 and
>> >> if someone complains we can always just revert it later.
>> >
>> > I suppose Christoph has tried with their internal processor, and it's
>> > benefit on performance,
>> > but it can't be open-source yet, so v2 patch set using C906 to demo
>> > and test that since that is
>> > the only processor with slow_unaligned_access=False.
>>
>> Well, that's a very different discussion.  The C906 tuning model should
>> be for the C906, not a proxy for some internal-only processor.  If the
>> goal here is to allow this pass to be flipped on by an out-of-tree
>> pipeline model then we can talk about it.
>>
>> > I agree on the C906 part, we never know it's benefit or not, so I propose
>> > adding one -mtune=ultra-size to make this test-able rather than changing C906.
>>
>> That's essentially the same conclusion we came to last time this came
>> up, except that we were calling it "-Oz" (because LLVM does).  I guess
>> we never got around having the broader GCC discussion about "-Oz".  IIRC
>> we had some other "-Oz" candidates we never got around to dealing with,
>> but that was a while ago so I'm not sure if any of that panned out.
>
> -Oz was a bad idea that Apple came up because GCC decided to start
> emitting store multiple on PowerPC around 13 years ago.
> I don't think we should repeat that mistake for GCC and especially for RISCV.
> If people want to optimize for size, they get the performance issues.

Makes sense.  Probably best to avoid adding the RISC-V specific version 
of this as well, then, as it's really just two sides of the same coin.

Sounds like we'll likely want to stop implementing -Os via a tuning on 
RISC-V: that was a convienent way to do it wen we didn't have any 
conflicts between -O and -mtune, but assuming this will eventually land 
that won't be valid any more.  That's a pretty mechinacial process.

It still leaves us with the question of what to do with this pass, which 
IMO really just depends on what the actual goal is here: if we're trying 
to optimize for the C906 then we should just wait for the benchmarks to 
demorstrate this is worth doing (though again, Kito, if you think this 
is good enough and want to flip this on I don't really care that much), 
but if we're trying to optimize for some other pipeline then we should 
really wait for that to show up.

I'm not going to speculate about what this new pipeline is, but if 
there's anything concrete announced about it then I'm happy to take a 
look.  Historically we've never been super strict about waiting for 
hardware before taking a pipeline model, but I do think we should have 
something as just trying to support any hypothetical future hardware 
will lead to insanity.  IMO we need to be extra explicit that we're 
willing to work with hardware vendors, as due to the nature of RISC-V 
that can get lost in translation, but there has to be some balance.
Vineet Gupta Nov. 2, 2021, 7:27 p.m. UTC | #19
On 7/22/21 6:29 AM, Kito Cheng via Gcc-patches wrote:
> Could you add a testcase? Otherwise LGTM.
> 
> Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> void foo(char *dst){
>     __builtin_memset(dst, 0, 15);
> }
> 
> On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> This patch enables the overlap-by-pieces feature of the by-pieces
>> infrastructure for inlining builtins in case the target has set
>> riscv_slow_unaligned_access_p to false.
>>
>> To demonstrate the effect for targets with fast unaligned access,
>> the following code sequences are generated for a 15-byte memset-zero.
>>
>> Without overlap_op_by_pieces we get:
>>    8e:   00053023                sd      zero,0(a0)
>>    92:   00052423                sw      zero,8(a0)
>>    96:   00051623                sh      zero,12(a0)
>>    9a:   00050723                sb      zero,14(a0)

To generate even the non optimized code above with gcc 11 [1][2], what 
do I need to do. Despite -mno-strict-align and trying -mtune={rocket, 
sifive-7-series}, I only get the fully unrolled version

foo:
# memcpy-15.c:2:    __builtin_memset(dst, 0, 15);
	sb	zero,0(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,1(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,2(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,3(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,4(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,5(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,6(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,7(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,8(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,9(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,10(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,11(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,12(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,13(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	sb	zero,14(a0)	#, MEM <char[1:15]> [(void *)dst_2(D)]
	ret	
	.size	foo, .-foo
	.ident	"GCC: (GNU) 11.1.0"

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581858.html
[2] https://github.com/kito-cheng/riscv-gcc/tree/riscv-gcc-11.1.0-zbabcs

Thx,
-Vineet

>>
>> With overlap_op_by_pieces we get:
>>    7e:   00053023                sd      zero,0(a0)
>>    82:   000533a3                sd      zero,7(a0)
>>
>> gcc/ChangeLog:
>>
>>          * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
>>          (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
>>          riscv_overlap_op_by_pieces.
>>
>> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
>> ---
>>   gcc/config/riscv/riscv.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
>> index 576960bb37c..98c76ba657a 100644
>> --- a/gcc/config/riscv/riscv.c
>> +++ b/gcc/config/riscv/riscv.c
>> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
>>     return riscv_slow_unaligned_access_p;
>>   }
>>
>> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
>> +
>> +static bool
>> +riscv_overlap_op_by_pieces (void)
>> +{
>> +  return !riscv_slow_unaligned_access_p;
>> +}
>> +
>>   /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
>>
>>   static bool
>> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
>>   #undef TARGET_SLOW_UNALIGNED_ACCESS
>>   #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
>>
>> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
>> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
>> +
>>   #undef TARGET_SECONDARY_MEMORY_NEEDED
>>   #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
>>
>> --
>> 2.31.1
>>
>
Christoph Müllner Nov. 2, 2021, 8:09 p.m. UTC | #20
On Tue, Nov 2, 2021 at 8:27 PM Vineet Gupta <vineetg@rivosinc.com> wrote:
>
> On 7/22/21 6:29 AM, Kito Cheng via Gcc-patches wrote:
> > Could you add a testcase? Otherwise LGTM.
> >
> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> > void foo(char *dst){
> >     __builtin_memset(dst, 0, 15);
> > }
> >
> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> This patch enables the overlap-by-pieces feature of the by-pieces
> >> infrastructure for inlining builtins in case the target has set
> >> riscv_slow_unaligned_access_p to false.
> >>
> >> To demonstrate the effect for targets with fast unaligned access,
> >> the following code sequences are generated for a 15-byte memset-zero.
> >>
> >> Without overlap_op_by_pieces we get:
> >>    8e:   00053023                sd      zero,0(a0)
> >>    92:   00052423                sw      zero,8(a0)
> >>    96:   00051623                sh      zero,12(a0)
> >>    9a:   00050723                sb      zero,14(a0)
>
> To generate even the non optimized code above with gcc 11 [1][2], what
> do I need to do. Despite -mno-strict-align and trying -mtune={rocket,
> sifive-7-series}, I only get the fully unrolled version

You need a tuning struct with slow_unaligned_access == false.
Both, Rocket and Sifive 7, have slow unaligned access set to true.
Mainline you have thead-c906 which would work.

BR
Christoph

>
> foo:
> # memcpy-15.c:2:    __builtin_memset(dst, 0, 15);
>         sb      zero,0(a0)      #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,1(a0)      #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,2(a0)      #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,3(a0)      #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,4(a0)      #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,5(a0)      #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,6(a0)      #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,7(a0)      #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,8(a0)      #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,9(a0)      #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,10(a0)     #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,11(a0)     #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,12(a0)     #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,13(a0)     #, MEM <char[1:15]> [(void *)dst_2(D)]
>         sb      zero,14(a0)     #, MEM <char[1:15]> [(void *)dst_2(D)]
>         ret
>         .size   foo, .-foo
>         .ident  "GCC: (GNU) 11.1.0"
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581858.html
> [2] https://github.com/kito-cheng/riscv-gcc/tree/riscv-gcc-11.1.0-zbabcs
>
> Thx,
> -Vineet
>
> >>
> >> With overlap_op_by_pieces we get:
> >>    7e:   00053023                sd      zero,0(a0)
> >>    82:   000533a3                sd      zero,7(a0)
> >>
> >> gcc/ChangeLog:
> >>
> >>          * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
> >>          (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
> >>          riscv_overlap_op_by_pieces.
> >>
> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
> >> ---
> >>   gcc/config/riscv/riscv.c | 11 +++++++++++
> >>   1 file changed, 11 insertions(+)
> >>
> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> >> index 576960bb37c..98c76ba657a 100644
> >> --- a/gcc/config/riscv/riscv.c
> >> +++ b/gcc/config/riscv/riscv.c
> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
> >>     return riscv_slow_unaligned_access_p;
> >>   }
> >>
> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> >> +
> >> +static bool
> >> +riscv_overlap_op_by_pieces (void)
> >> +{
> >> +  return !riscv_slow_unaligned_access_p;
> >> +}
> >> +
> >>   /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
> >>
> >>   static bool
> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
> >>   #undef TARGET_SLOW_UNALIGNED_ACCESS
> >>   #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
> >>
> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> >> +
> >>   #undef TARGET_SECONDARY_MEMORY_NEEDED
> >>   #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
> >>
> >> --
> >> 2.31.1
> >>
> >
Vineet Gupta Nov. 2, 2021, 8:15 p.m. UTC | #21
On 11/2/21 1:09 PM, Christoph Müllner wrote:
>>>> Without overlap_op_by_pieces we get:
>>>>     8e:   00053023                sd      zero,0(a0)
>>>>     92:   00052423                sw      zero,8(a0)
>>>>     96:   00051623                sh      zero,12(a0)
>>>>     9a:   00050723                sb      zero,14(a0)
>> To generate even the non optimized code above with gcc 11 [1][2], what
>> do I need to do. Despite -mno-strict-align and trying -mtune={rocket,
>> sifive-7-series}, I only get the fully unrolled version
> You need a tuning struct with slow_unaligned_access == false.
> Both, Rocket and Sifive 7, have slow unaligned access set to true.
> Mainline you have thead-c906 which would work.

But doesn't -mno-strict-align imply that ?

Thx,
-Vineet
Christoph Müllner Nov. 2, 2021, 9:18 p.m. UTC | #22
On Tue, Nov 2, 2021 at 9:15 PM Vineet Gupta <vineetg@rivosinc.com> wrote:
>
>
>
> On 11/2/21 1:09 PM, Christoph Müllner wrote:
> >>>> Without overlap_op_by_pieces we get:
> >>>>     8e:   00053023                sd      zero,0(a0)
> >>>>     92:   00052423                sw      zero,8(a0)
> >>>>     96:   00051623                sh      zero,12(a0)
> >>>>     9a:   00050723                sb      zero,14(a0)
> >> To generate even the non optimized code above with gcc 11 [1][2], what
> >> do I need to do. Despite -mno-strict-align and trying -mtune={rocket,
> >> sifive-7-series}, I only get the fully unrolled version
> > You need a tuning struct with slow_unaligned_access == false.
> > Both, Rocket and Sifive 7, have slow unaligned access set to true.
> > Mainline you have thead-c906 which would work.
>
> But doesn't -mno-strict-align imply that ?

Opposite direction.
With `-mno-strict-align` emitted code might contain unaligned accesses
if `slow_unaligned_access == false`.
If `slow_unaligned_access == false`, then `-mstrict-align` will
prevent unaligned accesses.
Usually, there is a good reason why `slow_unaliged_access` is set to
`true` (e.g. a significant penalty
in case of unaligned accesses). It wouldn't make sense to overrule this.


>
> Thx,
> -Vineet
Vineet Gupta Nov. 2, 2021, 10:04 p.m. UTC | #23
On 11/2/21 2:18 PM, Christoph Müllner wrote:
> On Tue, Nov 2, 2021 at 9:15 PM Vineet Gupta <vineetg@rivosinc.com> wrote:
>>
>>
>> On 11/2/21 1:09 PM, Christoph Müllner wrote:
>>>>>> Without overlap_op_by_pieces we get:
>>>>>>      8e:   00053023                sd      zero,0(a0)
>>>>>>      92:   00052423                sw      zero,8(a0)
>>>>>>      96:   00051623                sh      zero,12(a0)
>>>>>>      9a:   00050723                sb      zero,14(a0)
>>>> To generate even the non optimized code above with gcc 11 [1][2], what
>>>> do I need to do. Despite -mno-strict-align and trying -mtune={rocket,
>>>> sifive-7-series}, I only get the fully unrolled version
>>> You need a tuning struct with slow_unaligned_access == false.
>>> Both, Rocket and Sifive 7, have slow unaligned access set to true.
>>> Mainline you have thead-c906 which would work.
>> But doesn't -mno-strict-align imply that ?
> Opposite direction.

Took me a while to unpack :-)

> With `-mno-strict-align` emitted code might contain unaligned accesses
> if `slow_unaligned_access == false`.
> If `slow_unaligned_access == false`, then `-mstrict-align` will
> prevent unaligned accesses.
> Usually, there is a good reason why `slow_unaliged_access` is set to
> `true` (e.g. a significant penalty
> in case of unaligned accesses). It wouldn't make sense to overrule this.

Sure it makes sense since this is uarch fundamental.
Because of following snippet, unaligned access codegen can only be made 
more restrictive and not less (and really requires a compiler rebuild to 
experiment)

   riscv_slow_unaligned_access_p = (cpu->tune_param->slow_unaligned_access
                    || TARGET_STRICT_ALIGN);

Thx,
-Vineet
diff mbox series

Patch

diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
index 576960bb37c..98c76ba657a 100644
--- a/gcc/config/riscv/riscv.c
+++ b/gcc/config/riscv/riscv.c
@@ -5201,6 +5201,14 @@  riscv_slow_unaligned_access (machine_mode, unsigned int)
   return riscv_slow_unaligned_access_p;
 }
 
+/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
+
+static bool
+riscv_overlap_op_by_pieces (void)
+{
+  return !riscv_slow_unaligned_access_p;
+}
+
 /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
 
 static bool
@@ -5525,6 +5533,9 @@  riscv_asan_shadow_offset (void)
 #undef TARGET_SLOW_UNALIGNED_ACCESS
 #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
 
+#undef TARGET_OVERLAP_OP_BY_PIECES_P
+#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
+
 #undef TARGET_SECONDARY_MEMORY_NEEDED
 #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed